mbox series

[0/2] Fix for a very old KVM bug in the segment cache

Message ID 20240713013856.1568501-1-mlevitsk@redhat.com (mailing list archive)
Headers show
Series Fix for a very old KVM bug in the segment cache | expand

Message

Maxim Levitsky July 13, 2024, 1:38 a.m. UTC
Hi,

Recently, while trying to understand why the pmu_counters_test
selftest sometimes fails when run nested I stumbled
upon a very interesting and old bug:

It turns out that KVM caches guest segment state,
but this cache doesn't have any protection against concurrent use.

This usually works because the cache is per vcpu, and should
only be accessed by vCPU thread, however there is an exception:

If the full preemption is enabled in the host kernel,
it is possible that vCPU thread will be preempted, for
example during the vmx_vcpu_reset.

vmx_vcpu_reset resets the segment cache bitmask and then initializes
the segments in the vmcs, however if the vcpus is preempted in the
middle of this code, the kvm_arch_vcpu_put is called which
reads SS's AR bytes to determine if the vCPU is in the kernel mode,
which caches the old value.

Later vmx_vcpu_reset will set the SS's AR field to the correct value
in vmcs but the cache still contains an invalid value which
can later for example leak via KVM_GET_SREGS and such.

In particular, kvm selftests will do KVM_GET_SREGS,
and then KVM_SET_SREGS, with a broken SS's AR field passed as is,
which will lead to vm entry failure.

This issue is not a nested issue, and actually I was able
to reproduce it on bare metal, but due to timing it happens
much more often nested. The only requirement for this to happen
is to have full preemption enabled in the kernel which runs the selftest.

pmu_counters_test reproduces this issue well, because it creates
lots of short lived VMs, but the issue as was noted
about is not related to pmu.

To fix this issue, I wrapped the places which write the segment
fields with preempt_disable/enable. It's not an ideal fix, other options are
possible. Please tell me if you prefer these:

1. Getting rid of the segment cache. I am not sure how much it helps
these days - this code is very old.

2. Using a read/write lock - IMHO the cleanest solution but might
also affect performance.

3. Making the kvm_arch_vcpu_in_kernel not touch the cache
and instead do a vmread directly.
This is a shorter solution but probably less future proof.

Best regards,
	Maxim Levitsky

Maxim Levitsky (2):
  KVM: nVMX: use vmx_segment_cache_clear
  KVM: VMX: disable preemption when writing guest segment state

 arch/x86/kvm/vmx/nested.c |  7 ++++++-
 arch/x86/kvm/vmx/vmx.c    | 22 ++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h    |  5 +++++
 3 files changed, 29 insertions(+), 5 deletions(-)

Comments

Paolo Bonzini July 13, 2024, 10:22 a.m. UTC | #1
On 7/13/24 03:38, Maxim Levitsky wrote:
> 1. Getting rid of the segment cache. I am not sure how much it helps
> these days - this code is very old.
> 
> 2. Using a read/write lock - IMHO the cleanest solution but might
> also affect performance.

A read/write lock would cause a deadlock between the writer and the 
sched_out callback, since they run on the same CPU.

I think the root cause of the issue is that clearing the cache should be 
done _after_ the writes (and should have a barrier() at the beginning, 
if only for cleanliness).  So your patch 1 should leave the clearing of 
vmx->segment_cache.bitmask where it was.

However, that would still leave an assumption: that it's okay that a 
sched_out during vmx_vcpu_reset() (or other functions that write segment 
data in the VMCS) accesses stale data, as long as the stale data is not 
used after vmx_vcpu_reset() returns.  Your patch is a safer approach, 
but maybe wrap preempt_disable()/preempt_enable() with

	vmx_invalidate_segment_cache_start() {
		preempt_disable();
	}
	vmx_invalidate_segment_cache_end() {
		vmx->segment_cache.bitmask = 0;
		preempt_enable();
	}

Paolo
Maxim Levitsky July 16, 2024, 2:21 a.m. UTC | #2
On Sat, 2024-07-13 at 12:22 +0200, Paolo Bonzini wrote:
> On 7/13/24 03:38, Maxim Levitsky wrote:
> > 1. Getting rid of the segment cache. I am not sure how much it helps
> > these days - this code is very old.
> > 
> > 2. Using a read/write lock - IMHO the cleanest solution but might
> > also affect performance.
> 
> A read/write lock would cause a deadlock between the writer and the 
> sched_out callback, since they run on the same CPU.
> 
> I think the root cause of the issue is that clearing the cache should be 
> done _after_ the writes (and should have a barrier() at the beginning, 
> if only for cleanliness).  So your patch 1 should leave the clearing of 
> vmx->segment_cache.bitmask where it was.
> 
> However, that would still leave an assumption: that it's okay that a 
> sched_out during vmx_vcpu_reset() (or other functions that write segment 
> data in the VMCS) accesses stale data, as long as the stale data is not 
> used after vmx_vcpu_reset() returns.  Your patch is a safer approach, 
> but maybe wrap preempt_disable()/preempt_enable() with
> 
> 	vmx_invalidate_segment_cache_start() {
> 		preempt_disable();
> 	}
> 	vmx_invalidate_segment_cache_end() {
> 		vmx->segment_cache.bitmask = 0;
> 		preempt_enable();
> 	}
> 
> Paolo
> 

Hi Paolo!

This looks like a very good idea, I'll do this in v2.

Thanks,
Best regards,
	Maxim Levitsky