diff mbox series

[v2,2/2] KVM: VMX: disable preemption when touching segment fields

Message ID 20240716022014.240960-3-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series Fix for a very old KVM bug in the segment cache | expand

Commit Message

Maxim Levitsky July 16, 2024, 2:20 a.m. UTC
VMX code uses segment cache to avoid reading guest segment fields.

The cache is reset each time a segment's field (e.g base/access rights/etc)
is written, and then a new value of this field is written.

However if the vCPU is preempted between these two events, and this
segment field is read (e.g kvm reads SS's access rights to check
if the vCPU is in kernel mode), then old field value will get
cached and never updated.

Usually a lock is required to avoid such race but since vCPU segments
are only accessed by its vCPU thread, we can avoid a lock and
only disable preemption, in places where the segment cache
is invalidated and segment fields are updated.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c |  4 +++-
 arch/x86/kvm/vmx/vmx.c    | 25 +++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h    | 14 +++++++++++++-
 3 files changed, 35 insertions(+), 8 deletions(-)

Comments

Sean Christopherson July 16, 2024, 10:36 p.m. UTC | #1
On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> VMX code uses segment cache to avoid reading guest segment fields.
> 
> The cache is reset each time a segment's field (e.g base/access rights/etc)
> is written, and then a new value of this field is written.
> 
> However if the vCPU is preempted between these two events, and this
> segment field is read (e.g kvm reads SS's access rights to check
> if the vCPU is in kernel mode), then old field value will get
> cached and never updated.

It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
reads stale data.  Without that information, it's very hard to figure out how
getting preempted is problematic.

  vmx_vcpu_reset resets the segment cache bitmask and then initializes
  the segments in the vmcs, however if the vcpus is preempted in the
  middle of this code, the kvm_arch_vcpu_put is called which
  reads SS's AR bytes to determine if the vCPU is in the kernel mode,
  which caches the old value.

> Usually a lock is required to avoid such race but since vCPU segments
> are only accessed by its vCPU thread, we can avoid a lock and
> only disable preemption, in places where the segment cache
> is invalidated and segment fields are updated.

This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
that's problematic, it's any path that executes KVM code in interrupt context.
And it's not just limited to segment registers, any register that is conditionally
cached via arch.regs_avail is susceptible to races.

Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
RIP in NMI and/or IRQ context when handling a PMI.

A few possible ideas.

 1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.

 2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
    and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
    vmx_segment_cache_test_set() is invoked from IRQ or NMI context.

 3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
    rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
    it before kvm_after_interrupt(), and add the same hardening as #2.

    This is doable because kvm_guest_state() should never read guest state for
    PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
    write guest state in that window.  And the intent of the "preempted in kernel"
    check is to query vCPU state at the time of exit.

 5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).

My vote is probably for #2 or #4.  I definitely think we need WARNs in the caching
code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
I am fairly confident we can restrict it to checking CPL.

I don't hate this patch by any means, but I don't love disabling preemption in a
bunch of flows just so that the preempted_in_kernel logic works.
Maxim Levitsky July 25, 2024, 12:59 p.m. UTC | #2
On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote:
> On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > VMX code uses segment cache to avoid reading guest segment fields.
> > 
> > The cache is reset each time a segment's field (e.g base/access rights/etc)
> > is written, and then a new value of this field is written.
> > 
> > However if the vCPU is preempted between these two events, and this
> > segment field is read (e.g kvm reads SS's access rights to check
> > if the vCPU is in kernel mode), then old field value will get
> > cached and never updated.
> 
> It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
> reads stale data.  Without that information, it's very hard to figure out how
> getting preempted is problematic.

I will do this in next version of this patch.

> 
>   vmx_vcpu_reset resets the segment cache bitmask and then initializes
>   the segments in the vmcs, however if the vcpus is preempted in the
>   middle of this code, the kvm_arch_vcpu_put is called which
>   reads SS's AR bytes to determine if the vCPU is in the kernel mode,
>   which caches the old value.
> 
> > Usually a lock is required to avoid such race but since vCPU segments
> > are only accessed by its vCPU thread, we can avoid a lock and
> > only disable preemption, in places where the segment cache
> > is invalidated and segment fields are updated.
> 
> This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
> that's problematic, it's any path that executes KVM code in interrupt context.
> And it's not just limited to segment registers, any register that is conditionally
> cached via arch.regs_avail is susceptible to races.
> 
> Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
> RIP in NMI and/or IRQ context when handling a PMI.

> 
> A few possible ideas.
> 
>  1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.

This IMHO is the best solution. For segment cache its easy to do, the code
will be contained in vmx_read_guest_seg_* functions.

For other VMX registers, this can be lot of work due to the way the code is scattered
around. Still probably double.


> 
>  2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
>     and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
>     vmx_segment_cache_test_set() is invoked from IRQ or NMI context.

I agree on this, this is actually one of the suggestions I had originally.
( I didn't notice the kvm_arch_vcpu_get_ip though )

I think I will implement this suggestion.

> 
>  3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
>     rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
>     it before kvm_after_interrupt(), and add the same hardening as #2.
> 
>     This is doable because kvm_guest_state() should never read guest state for
>     PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
>     write guest state in that window.  And the intent of the "preempted in kernel"
>     check is to query vCPU state at the time of exit.
> 


>  5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).


> 
> My vote is probably for #2 or #4.
#4 causes a NULL pointer deference here :) 

>   I definitely think we need WARNs in the caching
> code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
> I am fairly confident we can restrict it to checking CPL.
> 
> I don't hate this patch by any means, but I don't love disabling preemption in a
> bunch of flows just so that the preempted_in_kernel logic works.
> 


Thanks for the suggestions!

Best regards,	
	Maxim Levitsky
Maxim Levitsky July 25, 2024, 5:37 p.m. UTC | #3
On Thu, 2024-07-25 at 08:59 -0400, Maxim Levitsky wrote:
> On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote:
> > On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > > VMX code uses segment cache to avoid reading guest segment fields.
> > > 
> > > The cache is reset each time a segment's field (e.g base/access rights/etc)
> > > is written, and then a new value of this field is written.
> > > 
> > > However if the vCPU is preempted between these two events, and this
> > > segment field is read (e.g kvm reads SS's access rights to check
> > > if the vCPU is in kernel mode), then old field value will get
> > > cached and never updated.
> > 
> > It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
> > reads stale data.  Without that information, it's very hard to figure out how
> > getting preempted is problematic.
> 
> I will do this in next version of this patch.
> 
> >   vmx_vcpu_reset resets the segment cache bitmask and then initializes
> >   the segments in the vmcs, however if the vcpus is preempted in the
> >   middle of this code, the kvm_arch_vcpu_put is called which
> >   reads SS's AR bytes to determine if the vCPU is in the kernel mode,
> >   which caches the old value.
> > 
> > > Usually a lock is required to avoid such race but since vCPU segments
> > > are only accessed by its vCPU thread, we can avoid a lock and
> > > only disable preemption, in places where the segment cache
> > > is invalidated and segment fields are updated.
> > 
> > This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
> > that's problematic, it's any path that executes KVM code in interrupt context.
> > And it's not just limited to segment registers, any register that is conditionally
> > cached via arch.regs_avail is susceptible to races.
> > 
> > Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
> > RIP in NMI and/or IRQ context when handling a PMI.
> > A few possible ideas.
> > 
> >  1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.
> 
> This IMHO is the best solution. For segment cache its easy to do, the code
> will be contained in vmx_read_guest_seg_* functions.
> 
> For other VMX registers, this can be lot of work due to the way the code is scattered
> around. Still probably double.
> 
> 
> >  2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
> >     and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
> >     vmx_segment_cache_test_set() is invoked from IRQ or NMI context.
> 
> I agree on this, this is actually one of the suggestions I had originally.
> ( I didn't notice the kvm_arch_vcpu_get_ip though )
> 
> I think I will implement this suggestion.
> 
> >  3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
> >     rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
> >     it before kvm_after_interrupt(), and add the same hardening as #2.
> > 
> >     This is doable because kvm_guest_state() should never read guest state for
> >     PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
> >     write guest state in that window.  And the intent of the "preempted in kernel"
> >     check is to query vCPU state at the time of exit.
> > 
> >  5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).
> > My vote is probably for #2 or #4.
> #4 causes a NULL pointer deference here :) 
> 
> >   I definitely think we need WARNs in the caching
> > code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
> > I am fairly confident we can restrict it to checking CPL.
> > 
> > I don't hate this patch by any means, but I don't love disabling preemption in a
> > bunch of flows just so that the preempted_in_kernel logic works.
> > 
> 
> Thanks for the suggestions!

Hi,

I decided to keep it simple. I'll send a patch which moves call to the vmx_segment_cache_clear
to be after we done with the segment initialization in vmx_vcpu_reset, and later I
write a refactoring/hardening to make sure that we don't read the cache
from the interrupt context.

Best regards,
	Maxim Levitsky



> 
> Best regards,	
> 	Maxim Levitsky
> 
> 
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d3ca1a772ae67..b6597fe5d011d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2470,7 +2470,7 @@  static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
 			   HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
 
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 
 		vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
 		vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -2508,6 +2508,8 @@  static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
 		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
 		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+
+		vmx_write_segment_cache_end(vmx);
 	}
 
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa9f307d9b18b..26a5efd34aef7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2171,12 +2171,14 @@  int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 #ifdef CONFIG_X86_64
 	case MSR_FS_BASE:
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 		vmcs_writel(GUEST_FS_BASE, data);
+		vmx_write_segment_cache_end(vmx);
 		break;
 	case MSR_GS_BASE:
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 		vmcs_writel(GUEST_GS_BASE, data);
+		vmx_write_segment_cache_end(vmx);
 		break;
 	case MSR_KERNEL_GS_BASE:
 		vmx_write_guest_kernel_gs_base(vmx, data);
@@ -3088,7 +3090,7 @@  static void enter_rmode(struct kvm_vcpu *vcpu)
 
 	vmx->rmode.vm86_active = 1;
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
 
 	vmcs_writel(GUEST_TR_BASE, kvm_vmx->tss_addr);
 	vmcs_write32(GUEST_TR_LIMIT, RMODE_TSS_SIZE - 1);
@@ -3109,6 +3111,8 @@  static void enter_rmode(struct kvm_vcpu *vcpu)
 	fix_rmode_seg(VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]);
 	fix_rmode_seg(VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]);
 	fix_rmode_seg(VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]);
+
+	vmx_write_segment_cache_end(vmx);
 }
 
 int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
@@ -3139,8 +3143,9 @@  int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 static void enter_lmode(struct kvm_vcpu *vcpu)
 {
 	u32 guest_tr_ar;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-	vmx_segment_cache_clear(to_vmx(vcpu));
+	vmx_write_segment_cache_start(vmx);
 
 	guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
 	if ((guest_tr_ar & VMX_AR_TYPE_MASK) != VMX_AR_TYPE_BUSY_64_TSS) {
@@ -3150,6 +3155,9 @@  static void enter_lmode(struct kvm_vcpu *vcpu)
 			     (guest_tr_ar & ~VMX_AR_TYPE_MASK)
 			     | VMX_AR_TYPE_BUSY_64_TSS);
 	}
+
+	vmx_write_segment_cache_end(vmx);
+
 	vmx_set_efer(vcpu, vcpu->arch.efer | EFER_LMA);
 }
 
@@ -3571,7 +3579,7 @@  void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
 
 	if (vmx->rmode.vm86_active && seg != VCPU_SREG_LDTR) {
 		vmx->rmode.segs[seg] = *var;
@@ -3601,6 +3609,8 @@  void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 		var->type |= 0x1; /* Accessed */
 
 	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+
+	vmx_write_segment_cache_end(vmx);
 }
 
 void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
@@ -4870,7 +4880,8 @@  void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx->hv_deadline_tsc = -1;
 	kvm_set_cr8(vcpu, 0);
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
+
 	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
 
 	seg_setup(VCPU_SREG_CS);
@@ -4899,6 +4910,8 @@  void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmcs_writel(GUEST_IDTR_BASE, 0);
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
 
+	vmx_write_segment_cache_end(vmx);
+
 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1689f0d59f435..cba14911032cd 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -755,9 +755,21 @@  static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu)
 	return  lapic_in_kernel(vcpu) && enable_ipiv;
 }
 
-static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
+static inline void vmx_write_segment_cache_start(struct vcpu_vmx *vmx)
+{
+	/* VMX segment cache can be accessed during preemption.
+	 * (e.g to determine the guest's CPL)
+	 *
+	 * To avoid caching a wrong value during such access, disable
+	 * the preemption
+	 */
+	preempt_disable();
+}
+
+static inline void vmx_write_segment_cache_end(struct vcpu_vmx  *vmx)
 {
 	vmx->segment_cache.bitmask = 0;
+	preempt_enable();
 }
 
 #endif /* __KVM_X86_VMX_H */