diff mbox series

[v3,2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset

Message ID 20240725175232.337266-3-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series Fix for a very old KVM bug in the segment cache | expand

Commit Message

Maxim Levitsky July 25, 2024, 5:52 p.m. UTC
reset the segment cache after segment initialization in vmx_vcpu_reset
to avoid stale uninitialized data being cached in the segment cache.

In particular the following scenario is possible when full preemption
is enabled:

- vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES
is written in vmx_vcpu_reset.

- During preemption, the kvm_arch_vcpu_in_kernel is called which
reads SS's segment AR byte to determine if the CPU was in the kernel.

That caches 0 value of SS.AR_BYTES, then eventually the vCPU thread will be
preempted back, then set the correct SS.AR_BYTES value in the vmcs
and the cached value will remain stale, and could be read e.g via
KVM_GET_SREGS.

Usually this is not a problem because VMX segment cache is reset on each
vCPU run, but if the userspace (e.g KVM selftests do) reads the segment
registers just after the vCPU was created, and modifies some of them
but passes through other registers and in this case SS.AR_BYTES,
the stale value of it will make it into the vmcs,
and later lead to a VM entry failure due to incorrect SS segment type.

Fix this by moving the vmx_segment_cache_clear() call to be after the
segments are initialized.

Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
getting stale data during the segment setup, and that issue will
be addressed later.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Chao Gao Aug. 6, 2024, 1:18 a.m. UTC | #1
On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:

KVM: VMX:

>reset the segment cache after segment initialization in vmx_vcpu_reset
>to avoid stale uninitialized data being cached in the segment cache.
>
>In particular the following scenario is possible when full preemption
>is enabled:
>
>- vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES
>is written in vmx_vcpu_reset.
>
>- During preemption, the kvm_arch_vcpu_in_kernel is called which
>reads SS's segment AR byte to determine if the CPU was in the kernel.
>
>That caches 0 value of SS.AR_BYTES, then eventually the vCPU thread will be
>preempted back, then set the correct SS.AR_BYTES value in the vmcs
>and the cached value will remain stale, and could be read e.g via
>KVM_GET_SREGS.
>
>Usually this is not a problem because VMX segment cache is reset on each
>vCPU run, but if the userspace (e.g KVM selftests do) reads the segment
>registers just after the vCPU was created, and modifies some of them
>but passes through other registers and in this case SS.AR_BYTES,
>the stale value of it will make it into the vmcs,
>and later lead to a VM entry failure due to incorrect SS segment type.

I looked into the same issue last week, which was reported by someone
internally.

>
>Fix this by moving the vmx_segment_cache_clear() call to be after the
>segments are initialized.
>
>Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
>getting stale data during the segment setup, and that issue will
>be addressed later.
>
>Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

Do you need a Fixes tag and/or Cc: stable?

>---
> arch/x86/kvm/vmx/vmx.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index fa9f307d9b18..d43bb755e15c 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmx->hv_deadline_tsc = -1;
> 	kvm_set_cr8(vcpu, 0);
> 
>-	vmx_segment_cache_clear(vmx);
>-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>-
> 	seg_setup(VCPU_SREG_CS);
> 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
>@@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmcs_writel(GUEST_IDTR_BASE, 0);
> 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> 
>+	vmx_segment_cache_clear(vmx);
>+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);

vmx_segment_cache_clear() is called in a few other sites. I think at least the
call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
after a write to it. if the write was preempted after the cache was cleared but
before the new value being written into VMCS, QEMU would find that SS.AR held a
stale value.

>+
> 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
> 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
> 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
>-- 
>2.26.3
>
>
Sean Christopherson Aug. 9, 2024, 3:27 p.m. UTC | #2
On Tue, Aug 06, 2024, Chao Gao wrote:
> On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> >Fix this by moving the vmx_segment_cache_clear() call to be after the
> >segments are initialized.
> >
> >Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> >getting stale data during the segment setup, and that issue will
> >be addressed later.
> >
> >Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Do you need a Fixes tag and/or Cc: stable?

Heh, it's an old one

  Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")

> 
> >---
> > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index fa9f307d9b18..d43bb755e15c 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > 	vmx->hv_deadline_tsc = -1;
> > 	kvm_set_cr8(vcpu, 0);
> > 
> >-	vmx_segment_cache_clear(vmx);
> >-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> >-
> > 	seg_setup(VCPU_SREG_CS);
> > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> >@@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > 
> >+	vmx_segment_cache_clear(vmx);
> >+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> 
> vmx_segment_cache_clear() is called in a few other sites. I think at least the
> call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> after a write to it. if the write was preempted after the cache was cleared but
> before the new value being written into VMCS, QEMU would find that SS.AR held a
> stale value.

Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
wrong, but it's obviously incomplete, and will be unnecessary if the preemption
issue is resolved.

[*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
Maxim Levitsky Sept. 9, 2024, 7:11 p.m. UTC | #3
On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> On Tue, Aug 06, 2024, Chao Gao wrote:
> > On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> > > Fix this by moving the vmx_segment_cache_clear() call to be after the
> > > segments are initialized.
> > > 
> > > Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> > > getting stale data during the segment setup, and that issue will
> > > be addressed later.
> > > 
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Do you need a Fixes tag and/or Cc: stable?
> 
> Heh, it's an old one
> 
>   Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
> 
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index fa9f307d9b18..d43bb755e15c 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > 	vmx->hv_deadline_tsc = -1;
> > > 	kvm_set_cr8(vcpu, 0);
> > > 
> > > -	vmx_segment_cache_clear(vmx);
> > > -	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > -
> > > 	seg_setup(VCPU_SREG_CS);
> > > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > 
> > > +	vmx_segment_cache_clear(vmx);
> > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > 
> > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > after a write to it. if the write was preempted after the cache was cleared but
> > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > stale value.
> 
> Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> issue is resolved.

Hi,

I was thinking to keep it simple, since the issue is mostly theoretical after this fix,
but I'll give this another try.

Best regards,
	Maxim Levitsky

> 
> [*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
>
Maxim Levitsky Sept. 10, 2024, 1:07 a.m. UTC | #4
On Mon, 2024-09-09 at 15:11 -0400, Maxim Levitsky wrote:
> On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> > On Tue, Aug 06, 2024, Chao Gao wrote:
> > > On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> > > > Fix this by moving the vmx_segment_cache_clear() call to be after the
> > > > segments are initialized.
> > > > 
> > > > Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> > > > getting stale data during the segment setup, and that issue will
> > > > be addressed later.
> > > > 
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > Do you need a Fixes tag and/or Cc: stable?
> > 
> > Heh, it's an old one
> > 
> >   Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
> > 
> > > > ---
> > > > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index fa9f307d9b18..d43bb755e15c 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > 	vmx->hv_deadline_tsc = -1;
> > > > 	kvm_set_cr8(vcpu, 0);
> > > > 
> > > > -	vmx_segment_cache_clear(vmx);
> > > > -	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > > -
> > > > 	seg_setup(VCPU_SREG_CS);
> > > > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > > > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> > > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > > 
> > > > +	vmx_segment_cache_clear(vmx);
> > > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > 
> > > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > > after a write to it. if the write was preempted after the cache was cleared but
> > > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > > stale value.
> > 
> > Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> > wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> > issue is resolved.
> 
> Hi,
> 
> I was thinking to keep it simple, since the issue is mostly theoretical after this fix,
> but I'll give this another try.
> 
> Best regards,
> 	Maxim Levitsky
> 
> > [*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
> > 

Hi,

This is what I am thinking, after going over this issue again:

Pre-populating the cache and/or adding 'exited_in_kernel' will waste vmreads on *each* vmexit, 
I worry that this is just not worth the mostly theoretical issue that we have.


Since the segment and the register cache only optimize the case of reading a same field twice or more,
I suspect that reading these fields always is worse performance wise than removing the segment cache
altogether and reading these fields again and again.


Finally all 3 places that read the segment cache, only access one piece of data (SS.AR or RIP), 
thus it doesn't really matter if they see an old or a new value. 

I mean in theory if userspace changes the SS's AR bytes out of the
blue, and then we get a preemption event, in theory as you say the old value is correct but it really
doesn't matter.

So IMHO, just ensuring that we invalidate the segment cache right after we do any changes is the simplest
solution.

I can in addition to that add a warning to kvm_register_is_available and vmx_segment_cache_test_set, that
will test that only SS.AR and RIP are read from the interrupt context, so that if in the future someone
attempts to read more fields, this issue can be re-evaluated.

Best regards,
	Maxim Levitsky
Sean Christopherson Sept. 25, 2024, 3:08 p.m. UTC | #5
On Mon, Sep 09, 2024, Maxim Levitsky wrote:
> On Mon, 2024-09-09 at 15:11 -0400, Maxim Levitsky wrote:
> > On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> > > > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > > > 
> > > > > +	vmx_segment_cache_clear(vmx);
> > > > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > > 
> > > > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > > > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > > > after a write to it. if the write was preempted after the cache was cleared but
> > > > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > > > stale value.
> > > 
> > > Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> > > wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> > > issue is resolved.
> > 
> > Hi,
> > 
> > I was thinking to keep it simple, since the issue is mostly theoretical
> > after this fix, but I'll give this another try.
>
> This is what I am thinking, after going over this issue again:
> 
> Pre-populating the cache and/or adding 'exited_in_kernel' will waste vmreads
> on *each* vmexit,

FWIW, KVM would only need to do the VMREAD on non-fastpath exits.

> I worry that this is just not worth the mostly theoretical issue that we
> have.

Yeah.  And cost aside, it's weird and hard to document and use properly because
it's such an edge case.  E.g. only applies preemptible kernels, and use of
exited_in_kernel would likely need to be restricted to the sched_out() preempt
logic, because anything really needs to check the "current" CPL.

> Since the segment and the register cache only optimize the case of reading a
> same field twice or more, I suspect that reading these fields always is worse
> performance wise than removing the segment cache altogether and reading these
> fields again and again.

For modern setups, yeah, the segment cache likely isn't helping much, though I
suspect it still gets a decent number of "hits" on CS.AR_BYTES via is_64_bit_mode().

But for older CPUs where KVM needs to emulate large chunks of code, I'm betting
the segment cache is an absolute must have.

> Finally all 3 places that read the segment cache, only access one piece of
> data (SS.AR or RIP), thus it doesn't really matter if they see an old or a
> new value. 
> 
> I mean in theory if userspace changes the SS's AR bytes out of the blue, and
> then we get a preemption event, in theory as you say the old value is correct
> but it really doesn't matter.
> 
> So IMHO, just ensuring that we invalidate the segment cache right after we do
> any changes is the simplest solution.

But it's not a very maintainable solution.  It fixes the immediate problem, but
doesn't do anything to help ensure that all future code invalidates the cache
after writing, nor does it guarantee that all future usage of SS.AR can tolerate
consuming stale values.

> I can in addition to that add a warning to kvm_register_is_available and
> vmx_segment_cache_test_set, that will test that only SS.AR and RIP are read
> from the interrupt context, so that if in the future someone attempts to read
> more fields, this issue can be re-evaluated.

There's no need to add anything to vmx_segment_cache_test_set(), because it uses
kvm_register_is_available().  I.e. adding logic in kvm_register_is_available()
will suffice.

If we explicitly allow VMCS accesses from PMI callbacks, which by we *know* can
tolerate stale data _and_ never run while KVM is updating segments, then we can
fix the preemption case by forcing a VMREAD and bypassing the cache.
 
And looking to the future, if vcpu->arch.guest_state_protected is moved/exposed
to common code in some way, then the common PMI code can skip trying to read guest
state, and the ugliness of open coding that check in the preemption path largely
goes away.

If you're ok with the idea, I'll write changelogs and post the below (probably over
two patches).  I don't love adding another kvm_x86_ops callback, but I couldn't
come up with anything less ugly.

---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/kvm_cache_regs.h      | 17 +++++++++++++++++
 arch/x86/kvm/svm/svm.c             |  1 +
 arch/x86/kvm/vmx/main.c            |  1 +
 arch/x86/kvm/vmx/vmx.c             | 23 ++++++++++++++++++-----
 arch/x86/kvm/vmx/vmx.h             |  1 +
 arch/x86/kvm/x86.c                 | 13 ++++++++++++-
 8 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 861d080ed4c6..5aff7222e40f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -34,6 +34,7 @@ KVM_X86_OP(set_msr)
 KVM_X86_OP(get_segment_base)
 KVM_X86_OP(get_segment)
 KVM_X86_OP(get_cpl)
+KVM_X86_OP(get_cpl_no_cache)
 KVM_X86_OP(set_segment)
 KVM_X86_OP(get_cs_db_l_bits)
 KVM_X86_OP(is_valid_cr0)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..3ae90df0a177 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1656,6 +1656,7 @@ struct kvm_x86_ops {
 	void (*get_segment)(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
 	int (*get_cpl)(struct kvm_vcpu *vcpu);
+	int (*get_cpl_no_cache)(struct kvm_vcpu *vcpu);
 	void (*set_segment)(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
 	void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l);
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b1eb46e26b2e..0370483003f6 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -43,6 +43,18 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
 BUILD_KVM_GPR_ACCESSORS(r15, R15)
 #endif
 
+/*
+ * Using the register cache from interrupt context is generally not allowed, as
+ * caching a register and marking it available/dirty can't be done atomically,
+ * i.e. accesses from interrupt context may clobber state or read stale data if
+ * the vCPU task is in the process of updating the cache.  The exception is if
+ * KVM is handling an IRQ/NMI from guest mode, as that bounded sequence doesn't
+ * touch the cache, it runs after the cache is reset (post VM-Exit), and PMIs
+ * need to several registers that are cacheable.
+ */
+#define kvm_assert_register_caching_allowed(vcpu)		\
+	lockdep_assert_once(in_task() ||			\
+			    READ_ONCE(vcpu->arch.handling_intr_from_guest))
 /*
  * avail  dirty
  * 0	  0	  register in VMCS/VMCB
@@ -53,24 +65,28 @@ BUILD_KVM_GPR_ACCESSORS(r15, R15)
 static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
 					     enum kvm_reg reg)
 {
+	kvm_assert_register_caching_allowed(vcpu);
 	return test_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
 
 static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
 					 enum kvm_reg reg)
 {
+	kvm_assert_register_caching_allowed(vcpu);
 	return test_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
 static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
 					       enum kvm_reg reg)
 {
+	kvm_assert_register_caching_allowed(vcpu);
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
 
 static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
 					   enum kvm_reg reg)
 {
+	kvm_assert_register_caching_allowed(vcpu);
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
 }
@@ -84,6 +100,7 @@ static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
 static __always_inline bool kvm_register_test_and_mark_available(struct kvm_vcpu *vcpu,
 								 enum kvm_reg reg)
 {
+	kvm_assert_register_caching_allowed(vcpu);
 	return arch___test_and_set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9df3e1e5ae81..50f6b0e03d04 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5031,6 +5031,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.get_segment = svm_get_segment,
 	.set_segment = svm_set_segment,
 	.get_cpl = svm_get_cpl,
+	.get_cpl_no_cache = svm_get_cpl,
 	.get_cs_db_l_bits = svm_get_cs_db_l_bits,
 	.is_valid_cr0 = svm_is_valid_cr0,
 	.set_cr0 = svm_set_cr0,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7668e2fb8043..92d35cc6cd15 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -50,6 +50,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.get_segment = vmx_get_segment,
 	.set_segment = vmx_set_segment,
 	.get_cpl = vmx_get_cpl,
+	.get_cpl_no_cache = vmx_get_cpl_no_cache,
 	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
 	.is_valid_cr0 = vmx_is_valid_cr0,
 	.set_cr0 = vmx_set_cr0,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c67e448c6ebd..e2483678eca1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3568,16 +3568,29 @@ u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
 	return vmx_read_guest_seg_base(to_vmx(vcpu), seg);
 }
 
-int vmx_get_cpl(struct kvm_vcpu *vcpu)
+static int __vmx_get_cpl(struct kvm_vcpu *vcpu, bool no_cache)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int ar;
 
 	if (unlikely(vmx->rmode.vm86_active))
 		return 0;
-	else {
-		int ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
-		return VMX_AR_DPL(ar);
-	}
+
+	if (no_cache)
+		ar = vmcs_read32(GUEST_SS_AR_BYTES);
+	else
+		ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
+	return VMX_AR_DPL(ar);
+}
+
+int vmx_get_cpl(struct kvm_vcpu *vcpu)
+{
+	return __vmx_get_cpl(vcpu, false);
+}
+
+int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu)
+{
+	return __vmx_get_cpl(vcpu, true);
 }
 
 static u32 vmx_segment_access_rights(struct kvm_segment *var)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2325f773a20b..bcf40c7f3a38 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -385,6 +385,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
 			unsigned long fs_base, unsigned long gs_base);
 int vmx_get_cpl(struct kvm_vcpu *vcpu);
+int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu);
 bool vmx_emulation_required(struct kvm_vcpu *vcpu);
 unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
 void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..941245082647 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5094,7 +5094,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	int idx;
 
 	if (vcpu->preempted) {
-		vcpu->arch.preempted_in_kernel = kvm_arch_vcpu_in_kernel(vcpu);
+		/*
+		 * Assume protected guests are in-kernel.  Inefficient yielding
+		 * due to false positives is preferable to never yielding due
+		 * to false negatives.
+		 */
+		vcpu->arch.preempted_in_kernel = vcpu->arch.guest_state_protected ||
+						 !kvm_x86_call(get_cpl_no_cache)(vcpu);
 
 		/*
 		 * Take the srcu lock as memslots will be accessed to check the gfn
@@ -13207,6 +13213,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
+	/* TODO: Elide the call in the kvm_guest_get_ip() perf callback. */
 	if (vcpu->arch.guest_state_protected)
 		return true;
 
@@ -13215,6 +13222,10 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 
 unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
 {
+	/* TODO: Elide the call in the kvm_guest_get_ip() perf callback. */
+	if (vcpu->arch.guest_state_protected)
+		return 0;
+
 	return kvm_rip_read(vcpu);
 }
 

base-commit: 3f8df6285271d9d8f17d733433e5213a63b83a0b
--
Maxim Levitsky Sept. 25, 2024, 5:05 p.m. UTC | #6
On Wed, 2024-09-25 at 08:08 -0700, Sean Christopherson wrote:
> On Mon, Sep 09, 2024, Maxim Levitsky wrote:
> > On Mon, 2024-09-09 at 15:11 -0400, Maxim Levitsky wrote:
> > > On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> > > > > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > > > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > > > > 
> > > > > > +	vmx_segment_cache_clear(vmx);
> > > > > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > > > 
> > > > > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > > > > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > > > > after a write to it. if the write was preempted after the cache was cleared but
> > > > > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > > > > stale value.
> > > > 
> > > > Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> > > > wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> > > > issue is resolved.
> > > 
> > > Hi,
> > > 
> > > I was thinking to keep it simple, since the issue is mostly theoretical
> > > after this fix, but I'll give this another try.
> > 
> > This is what I am thinking, after going over this issue again:
> > 
> > Pre-populating the cache and/or adding 'exited_in_kernel' will waste vmreads
> > on *each* vmexit,
> 
> FWIW, KVM would only need to do the VMREAD on non-fastpath exits.
> 
> > I worry that this is just not worth the mostly theoretical issue that we
> > have.
> 
> Yeah.  And cost aside, it's weird and hard to document and use properly because
> it's such an edge case.  E.g. only applies preemptible kernels, and use of
> exited_in_kernel would likely need to be restricted to the sched_out() preempt
> logic, because anything really needs to check the "current" CPL.
> 
> > Since the segment and the register cache only optimize the case of reading a
> > same field twice or more, I suspect that reading these fields always is worse
> > performance wise than removing the segment cache altogether and reading these
> > fields again and again.
> 
> For modern setups, yeah, the segment cache likely isn't helping much, though I
> suspect it still gets a decent number of "hits" on CS.AR_BYTES via is_64_bit_mode().
> 
> But for older CPUs where KVM needs to emulate large chunks of code, I'm betting
> the segment cache is an absolute must have.
> 
> > Finally all 3 places that read the segment cache, only access one piece of
> > data (SS.AR or RIP), thus it doesn't really matter if they see an old or a
> > new value. 
> > 
> > I mean in theory if userspace changes the SS's AR bytes out of the blue, and
> > then we get a preemption event, in theory as you say the old value is correct
> > but it really doesn't matter.
> > 
> > So IMHO, just ensuring that we invalidate the segment cache right after we do
> > any changes is the simplest solution.
> 
> But it's not a very maintainable solution.  It fixes the immediate problem, but
> doesn't do anything to help ensure that all future code invalidates the cache
> after writing,

If we wrap segment cache access with something like segment_cache_access_begin()/end(),
we can ensure that segment cache is only modified then (with some macros even maybe),
or that at least it is clear to the developer that all writes should be wrapped by these
functions.

I also do think that we should still re-order the segment cache accesses in vmx_vcpu_reset()
and other places just for the sake of consistency.


>  nor does it guarantee that all future usage of SS.AR can tolerate
> consuming stale values.
> 
> > I can in addition to that add a warning to kvm_register_is_available and
> > vmx_segment_cache_test_set, that will test that only SS.AR and RIP are read
> > from the interrupt context, so that if in the future someone attempts to read
> > more fields, this issue can be re-evaluated.
> 
> There's no need to add anything to vmx_segment_cache_test_set(), because it uses
> kvm_register_is_available().  I.e. adding logic in kvm_register_is_available()
> will suffice.
> 
> If we explicitly allow VMCS accesses from PMI callbacks, which by we *know* can
> tolerate stale data _and_ never run while KVM is updating segments, then we can
> fix the preemption case by forcing a VMREAD and bypassing the cache.
>  
> And looking to the future, if vcpu->arch.guest_state_protected is moved/exposed
> to common code in some way, then the common PMI code can skip trying to read guest
> state, and the ugliness of open coding that check in the preemption path largely
> goes away.
This is assuming that most VMs will be protected in the future?

> 
> If you're ok with the idea, I'll write changelogs and post the below (probably over
> two patches).  I don't love adding another kvm_x86_ops callback, but I couldn't
> come up with anything less ugly.

This was one of the reasons I didn't want to write something like that.
If we indeed only add callback for get_cpl_no_cache, then it is tolerable.


> 
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  1 +
>  arch/x86/kvm/kvm_cache_regs.h      | 17 +++++++++++++++++
>  arch/x86/kvm/svm/svm.c             |  1 +
>  arch/x86/kvm/vmx/main.c            |  1 +
>  arch/x86/kvm/vmx/vmx.c             | 23 ++++++++++++++++++-----
>  arch/x86/kvm/vmx/vmx.h             |  1 +
>  arch/x86/kvm/x86.c                 | 13 ++++++++++++-
>  8 files changed, 52 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 861d080ed4c6..5aff7222e40f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -34,6 +34,7 @@ KVM_X86_OP(set_msr)
>  KVM_X86_OP(get_segment_base)
>  KVM_X86_OP(get_segment)
>  KVM_X86_OP(get_cpl)
> +KVM_X86_OP(get_cpl_no_cache)
>  KVM_X86_OP(set_segment)
>  KVM_X86_OP(get_cs_db_l_bits)
>  KVM_X86_OP(is_valid_cr0)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6d9f763a7bb9..3ae90df0a177 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1656,6 +1656,7 @@ struct kvm_x86_ops {
>  	void (*get_segment)(struct kvm_vcpu *vcpu,
>  			    struct kvm_segment *var, int seg);
>  	int (*get_cpl)(struct kvm_vcpu *vcpu);
> +	int (*get_cpl_no_cache)(struct kvm_vcpu *vcpu);
>  	void (*set_segment)(struct kvm_vcpu *vcpu,
>  			    struct kvm_segment *var, int seg);
>  	void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l);
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index b1eb46e26b2e..0370483003f6 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -43,6 +43,18 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
>  BUILD_KVM_GPR_ACCESSORS(r15, R15)
>  #endif
>  
> +/*
> + * Using the register cache from interrupt context is generally not allowed, as
> + * caching a register and marking it available/dirty can't be done atomically,
> + * i.e. accesses from interrupt context may clobber state or read stale data if
> + * the vCPU task is in the process of updating the cache.  The exception is if
> + * KVM is handling an IRQ/NMI from guest mode, as that bounded sequence doesn't
> + * touch the cache, it runs after the cache is reset (post VM-Exit), and PMIs
> + * need to several registers that are cacheable.
> + */
> +#define kvm_assert_register_caching_allowed(vcpu)		\
> +	lockdep_assert_once(in_task() ||			\
> +			    READ_ONCE(vcpu->arch.handling_intr_from_guest))

This is ugly, but on the second thought reasonable, given the circumstances.

How about using kvm_arch_pmi_in_guest() instead? It is a tiny bit more accurate
and self-documenting IMHO.


Also, how about checking for in_task() in __vmx_get_cpl() and then avoiding the cache?
This way we will avoid adding a new callback, and in theory if there is more code that
tries to read CPL from interrupt context, it will work for free. 

But on the other hand we might actually not want new code to get this for free. 
Is this the reason you added the callback?

>  /*
>   * avail  dirty
>   * 0	  0	  register in VMCS/VMCB
> @@ -53,24 +65,28 @@ BUILD_KVM_GPR_ACCESSORS(r15, R15)
>  static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
>  					     enum kvm_reg reg)
>  {
> +	kvm_assert_register_caching_allowed(vcpu);
>  	return test_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>  }
>  
>  static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
>  					 enum kvm_reg reg)
>  {
> +	kvm_assert_register_caching_allowed(vcpu);
>  	return test_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
>  }
>  
>  static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
>  					       enum kvm_reg reg)
>  {
> +	kvm_assert_register_caching_allowed(vcpu);
>  	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>  }
>  
>  static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
>  					   enum kvm_reg reg)
>  {
> +	kvm_assert_register_caching_allowed(vcpu);
>  	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>  	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
>  }
> @@ -84,6 +100,7 @@ static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
>  static __always_inline bool kvm_register_test_and_mark_available(struct kvm_vcpu *vcpu,
>  								 enum kvm_reg reg)
>  {
> +	kvm_assert_register_caching_allowed(vcpu);
>  	return arch___test_and_set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 9df3e1e5ae81..50f6b0e03d04 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5031,6 +5031,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.get_segment = svm_get_segment,
>  	.set_segment = svm_set_segment,
>  	.get_cpl = svm_get_cpl,
> +	.get_cpl_no_cache = svm_get_cpl,
>  	.get_cs_db_l_bits = svm_get_cs_db_l_bits,
>  	.is_valid_cr0 = svm_is_valid_cr0,
>  	.set_cr0 = svm_set_cr0,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 7668e2fb8043..92d35cc6cd15 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -50,6 +50,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.get_segment = vmx_get_segment,
>  	.set_segment = vmx_set_segment,
>  	.get_cpl = vmx_get_cpl,
> +	.get_cpl_no_cache = vmx_get_cpl_no_cache,
>  	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
>  	.is_valid_cr0 = vmx_is_valid_cr0,
>  	.set_cr0 = vmx_set_cr0,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c67e448c6ebd..e2483678eca1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3568,16 +3568,29 @@ u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
>  	return vmx_read_guest_seg_base(to_vmx(vcpu), seg);
>  }
>  
> -int vmx_get_cpl(struct kvm_vcpu *vcpu)
> +static int __vmx_get_cpl(struct kvm_vcpu *vcpu, bool no_cache)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int ar;
>  
>  	if (unlikely(vmx->rmode.vm86_active))
>  		return 0;
> -	else {
> -		int ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
> -		return VMX_AR_DPL(ar);
> -	}
> +
> +	if (no_cache)
> +		ar = vmcs_read32(GUEST_SS_AR_BYTES);
> +	else
> +		ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
> +	return VMX_AR_DPL(ar);
> +}
> +
> +int vmx_get_cpl(struct kvm_vcpu *vcpu)
> +{
> +	return __vmx_get_cpl(vcpu, false);
> +}
> +
> +int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu)
> +{
> +	return __vmx_get_cpl(vcpu, true);
>  }
>  
>  static u32 vmx_segment_access_rights(struct kvm_segment *var)
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 2325f773a20b..bcf40c7f3a38 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -385,6 +385,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
>  void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
>  			unsigned long fs_base, unsigned long gs_base);
>  int vmx_get_cpl(struct kvm_vcpu *vcpu);
> +int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu);
>  bool vmx_emulation_required(struct kvm_vcpu *vcpu);
>  unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
>  void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 83fe0a78146f..941245082647 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5094,7 +5094,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  	int idx;
>  
>  	if (vcpu->preempted) {
> -		vcpu->arch.preempted_in_kernel = kvm_arch_vcpu_in_kernel(vcpu);
> +		/*
> +		 * Assume protected guests are in-kernel.  Inefficient yielding
> +		 * due to false positives is preferable to never yielding due
> +		 * to false negatives.
> +		 */
> +		vcpu->arch.preempted_in_kernel = vcpu->arch.guest_state_protected ||
> +						 !kvm_x86_call(get_cpl_no_cache)(vcpu);
>  
>  		/*
>  		 * Take the srcu lock as memslots will be accessed to check the gfn
> @@ -13207,6 +13213,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  
>  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>  {
> +	/* TODO: Elide the call in the kvm_guest_get_ip() perf callback. */
>  	if (vcpu->arch.guest_state_protected)
>  		return true;
>  
> @@ -13215,6 +13222,10 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>  
>  unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
>  {
> +	/* TODO: Elide the call in the kvm_guest_get_ip() perf callback. */
> +	if (vcpu->arch.guest_state_protected)
> +		return 0;
> +
>  	return kvm_rip_read(vcpu);
>  }
>  
> 
> base-commit: 3f8df6285271d9d8f17d733433e5213a63b83a0b

Overall I think I agree with this patch.

Best regards,
      Maxim Levitsky
Sean Christopherson Sept. 25, 2024, 6:30 p.m. UTC | #7
On Wed, Sep 25, 2024, Maxim Levitsky wrote:
> On Wed, 2024-09-25 at 08:08 -0700, Sean Christopherson wrote:
> > > Finally all 3 places that read the segment cache, only access one piece of
> > > data (SS.AR or RIP), thus it doesn't really matter if they see an old or a
> > > new value. 
> > > 
> > > I mean in theory if userspace changes the SS's AR bytes out of the blue, and
> > > then we get a preemption event, in theory as you say the old value is correct
> > > but it really doesn't matter.
> > > 
> > > So IMHO, just ensuring that we invalidate the segment cache right after we do
> > > any changes is the simplest solution.
> > 
> > But it's not a very maintainable solution.  It fixes the immediate problem, but
> > doesn't do anything to help ensure that all future code invalidates the cache
> > after writing,
> 
> If we wrap segment cache access with something like segment_cache_access_begin()/end(),
> we can ensure that segment cache is only modified then (with some macros even maybe),
> or that at least it is clear to the developer that all writes should be wrapped by these
> functions.
> 
> I also do think that we should still re-order the segment cache accesses in vmx_vcpu_reset()
> and other places just for the sake of consistency.

Yeah, I've no objection to doing that, I just don't want that to be the long-term
solution.

> >  nor does it guarantee that all future usage of SS.AR can tolerate
> > consuming stale values.
> > 
> > > I can in addition to that add a warning to kvm_register_is_available and
> > > vmx_segment_cache_test_set, that will test that only SS.AR and RIP are read
> > > from the interrupt context, so that if in the future someone attempts to read
> > > more fields, this issue can be re-evaluated.
> > 
> > There's no need to add anything to vmx_segment_cache_test_set(), because it uses
> > kvm_register_is_available().  I.e. adding logic in kvm_register_is_available()
> > will suffice.
> > 
> > If we explicitly allow VMCS accesses from PMI callbacks, which by we *know* can
> > tolerate stale data _and_ never run while KVM is updating segments, then we can
> > fix the preemption case by forcing a VMREAD and bypassing the cache.
> >  
> > And looking to the future, if vcpu->arch.guest_state_protected is moved/exposed
> > to common code in some way, then the common PMI code can skip trying to read guest
> > state, and the ugliness of open coding that check in the preemption path largely
> > goes away.
> This is assuming that most VMs will be protected in the future?

No, the assumption is that other architectures will have VMs with protected guest
state, e.g. ARM's CCA stuff, at which point handling the "don't bother reading
guest RIP" logic in the common PMI handler is worth doing.

> > If you're ok with the idea, I'll write changelogs and post the below (probably over
> > two patches).  I don't love adding another kvm_x86_ops callback, but I couldn't
> > come up with anything less ugly.
> 
> This was one of the reasons I didn't want to write something like that.
> If we indeed only add callback for get_cpl_no_cache, then it is tolerable.

...

> > diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> > index b1eb46e26b2e..0370483003f6 100644
> > --- a/arch/x86/kvm/kvm_cache_regs.h
> > +++ b/arch/x86/kvm/kvm_cache_regs.h
> > @@ -43,6 +43,18 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
> >  BUILD_KVM_GPR_ACCESSORS(r15, R15)
> >  #endif
> >  
> > +/*
> > + * Using the register cache from interrupt context is generally not allowed, as
> > + * caching a register and marking it available/dirty can't be done atomically,
> > + * i.e. accesses from interrupt context may clobber state or read stale data if
> > + * the vCPU task is in the process of updating the cache.  The exception is if
> > + * KVM is handling an IRQ/NMI from guest mode, as that bounded sequence doesn't
> > + * touch the cache, it runs after the cache is reset (post VM-Exit), and PMIs
> > + * need to several registers that are cacheable.
> > + */
> > +#define kvm_assert_register_caching_allowed(vcpu)		\
> > +	lockdep_assert_once(in_task() ||			\
> > +			    READ_ONCE(vcpu->arch.handling_intr_from_guest))
> 
> This is ugly, but on the second thought reasonable, given the circumstances.
> 
> How about using kvm_arch_pmi_in_guest() instead? It is a tiny bit more accurate
> and self-documenting IMHO.

Ah, yeah, good idea.

> Also, how about checking for in_task() in __vmx_get_cpl() and then avoiding the cache?
> This way we will avoid adding a new callback, and in theory if there is more code that
> tries to read CPL from interrupt context, it will work for free. 
> 
> But on the other hand we might actually not want new code to get this for free. 
> Is this the reason you added the callback?

Yeah, exactly.  I actually started coding up using in_task(), but I didn't want
to allow all reads from !in_task() because then it would do the wrong thing for
any usage that isn't tolerant of stale data, i.e. where KVM _should_ read from
the cache.  Even worse, such bugs wouldn't be caught because the in_task() check
would bypass kvm_assert_register_caching_allowed().

I considered guarding the in_task() code with preempt_model_preemptible() so that
non-preemptible kernels would always use the cache, i.e. would detect unexpected
cases.   But that felt like a bit of a hack, and for preempt_dynamic kernels
preempt_model_preemptible() requires a function call, at which point it's entirely
possible that reading from the cache would end up being slower than doing VMREAD.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa9f307d9b18..d43bb755e15c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4870,9 +4870,6 @@  void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx->hv_deadline_tsc = -1;
 	kvm_set_cr8(vcpu, 0);
 
-	vmx_segment_cache_clear(vmx);
-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
-
 	seg_setup(VCPU_SREG_CS);
 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
@@ -4899,6 +4896,9 @@  void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmcs_writel(GUEST_IDTR_BASE, 0);
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
 
+	vmx_segment_cache_clear(vmx);
+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
+
 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);