diff mbox series

[1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()

Message ID 20250124191109.205955-2-pbonzini@redhat.com (mailing list archive)
State New
Headers show
Series KVM: x86: Strengthen locking rules for kvm_lock | expand

Commit Message

Paolo Bonzini Jan. 24, 2025, 7:11 p.m. UTC
Protect the whole function with kvm_lock() so that all accesses to
nx_hugepage_mitigation_hard_disabled are under the lock; but drop it
when calling out to the MMU to avoid complex circular locking
situations such as the following:

__kvm_set_memory_region()
  lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                             lock(&kvm->slots_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
  sync(&kvm->srcu)

The deadlock is basically theoretical but anyway it is as follows:
- __kvm_set_memory_region() waits for kvm->srcu with kvm->slots_lock taken
- set_nx_huge_pages() waits for kvm->slots_lock with kvm_lock taken
- __kvmclock_cpufreq_notifier() waits for kvm_lock with cpu_hotplug_lock taken
- KVM_RUN waits for cpu_hotplug_lock with kvm->srcu taken
- therefore __kvm_set_memory_region() never completes synchronize_srcu(&kvm->srcu).

To break the deadlock, release kvm_lock while taking kvm->slots_lock, which
breaks the chain:

  lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
                             unlock(kvm_lock)
                                                       unlock(kvm_lock)
                                                       unlock(cpu_hotplug_lock)
                                                                                       unlock(cpu_hotplug_lock)
                                                                                   unlock(&kvm->srcu)
                             lock(&kvm->slots_lock)
  sync(&kvm->srcu)
  unlock(&kvm->slots_lock)
                             unlock(&kvm->slots_lock)

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Sean Christopherson Jan. 24, 2025, 8:11 p.m. UTC | #1
On Fri, Jan 24, 2025, Paolo Bonzini wrote:
> Protect the whole function with kvm_lock() so that all accesses to
> nx_hugepage_mitigation_hard_disabled are under the lock; but drop it
> when calling out to the MMU to avoid complex circular locking
> situations such as the following:

...

> To break the deadlock, release kvm_lock while taking kvm->slots_lock, which
> breaks the chain:

Heh, except it's all kinds of broken.  IMO, biting the bullet and converting to
an SRCU-protected list is going to be far less work in the long run.

> @@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>  	if (new_val != old_val) {
>  		struct kvm *kvm;
>  
> -		mutex_lock(&kvm_lock);
> -
>  		list_for_each_entry(kvm, &vm_list, vm_list) {

This is unsafe, as vm_list can be modified while kvm_lock is dropped.  And
using list_for_each_entry_safe() doesn't help, because the _next_ entry have been
freed.

> +			kvm_get_kvm(kvm);

This needs to be:

		if (!kvm_get_kvm_safe(kvm))
			continue;

because the last reference to the VM could already have been put.

> +			mutex_unlock(&kvm_lock);
> +
>  			mutex_lock(&kvm->slots_lock);
>  			kvm_mmu_zap_all_fast(kvm);
>  			mutex_unlock(&kvm->slots_lock);
>  
>  			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);

See my bug report on this being a NULL pointer deref.

> +
> +			mutex_lock(&kvm_lock);
> +			kvm_put_kvm(kvm);

The order is backwards, kvm_put_kvm() needs to be called before acquiring kvm_lock.
If the last reference is put, kvm_put_kvm() => kvm_destroy_vm() will deadlock on
kvm_lock.

>  		}
> -		mutex_unlock(&kvm_lock);
>  	}
Paolo Bonzini Jan. 24, 2025, 10:19 p.m. UTC | #2
Il ven 24 gen 2025, 21:11 Sean Christopherson <seanjc@google.com> ha scritto:
> Heh, except it's all kinds of broken.

Yes, I didn't even try.

> IMO, biting the bullet and converting to
> an SRCU-protected list is going to be far less work in the long run.

I did try a long SRCU critical section and it was unreviewable. It
ends up a lot less manageable than just making the lock a leaf,
especially as the lock hierarchy spans multiple subsystems (static
key, KVM, cpufreq---thanks CPU hotplug lock...). I also didn't like
adding a synchronization primitive that's... kinda single-use, but
that would not be a blocker of course.

So the second attempt was regular RCU, which looked a lot like this
patch. I started writing all the dances to find a struct kvm that
passes kvm_get_kvm_safe() before you do rcu_read_unlock() and drop the
previous one (because you cannot do kvm_put_kvm() within the RCU read
side) and set aside the idea, incorrectly thinking that they were not
needed with kvm_lock. Plus I didn't like having to keep alive a bunch
of data for a whole grace period if call_rcu() is used.

So for the third attempt I could have chosen between dropping the SRCU
or just using kvm_lock. I didn't even think of SRCU to be honest,
because everything so far looked so bad, but it does seem a little
better than RCU. At least, if kvm_destroy_vm() uses call_srcu(), you
can call kvm_put_kvm() within srcu_read_lock()...srcu_read_unlock().
It would look something like

  list_for_each_entry_srcu(kvm, &vm_list, vm_list, 1) {
    if (!kvm_get_kvm_safe(kvm))
      continue;

    /* kvm is protected by the reference count now. */
    srcu_read_unlock(&kvm_srcu);
    ...
    srcu_read_lock(&kvm_srcu);
    /* kvm stays alive, and next can be read, until the next
srcu_read_unlock() */
    kvm_put_kvm(kvm);
  }
  srcu_read_unlock(&kvm_srcu);

but again I am not sure how speedy call_srcu() is in reclaiming the
data, even in the common case where set_nx_huge_pages() or any other
RCU reader (none of them is frequent) isn't running. If you don't use
call_srcu() it becomes just as bad as RCU or kvm_lock.

So... let's talk about kvm_lock.

> > @@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> > +                     kvm_get_kvm(kvm);
>
> This needs to be:
>
>                 if (!kvm_get_kvm_safe(kvm))
>                         continue;

If we go for kvm_lock, kvm_get_kvm() *can* be made safe within the
critical section; if kvm_put_kvm() uses refcount_dec_and_mutex_lock(),
then the 1->0 transition happens under kvm_lock and cannot race with
kvm_get_kvm() (the mutex can be dropped as soon as
refcount_dec_and_mutex_lock() returns, it's really just the decrement
that needs to be within the critical section).

> >       if (new_val != old_val) {
> >               struct kvm *kvm;
> >
> > -             mutex_lock(&kvm_lock);
> > -
> >               list_for_each_entry(kvm, &vm_list, vm_list) {
>
> This is unsafe, as vm_list can be modified while kvm_lock is dropped.  And
> using list_for_each_entry_safe() doesn't help, because the _next_ entry have been
> freed.

list_for_each_entry_safe() is broken, but list_for_each_entry() can be
used. The problem is the call to kvm_put_kvm(), which is where the kvm
struct is freed thus breaking the foreach loop. So next must be read
and ref'd _before_ kvm_put_kvm(), then you can do

  kvm_get_kvm(kvm);
  mutex_unlock(&kvm_lock);
  if (prev)
    kvm_put_kvm(prev);
  ...
  mutex_lock(&kvm_lock);
  prev = kvm;

I don't know... there are few-enough readers that SRCU seems a bit
misplaced and it has the issue of keeping the VM data alive; while
kvm_lock has uglier code with the kvm_put_kvm() looking really
misplaced. If there were many instances one could write a nice
iterator, but for just one use?

Hmm... I wonder if something like

  if (poll_state_synchronize_srcu(&kvm_srcu,
          get_state_synchronize_srcu(&kvm_srcu))) {
    kvm_destroy_vm_cb(&kvm->rcu_head);
  } else {
    call_srcu(&kvm_srcu, &kvm->rcu_head, kvm_destroy_vm_cb);
  }

catches the case where there's no concurrent reader. If so, SRCU would
be a winner undoubtedly, but being the only user of a tricky RCU API
doesn't give me warm and fuzzy feelings. I'm still team kvm_lock for
now.

Anyhow I can prepare a tested version next Monday, with either
kvm_lock or with SRCU if the above trick works. Unless I showed that
it's trickier than it seems and successfully nerd-sniped you.
Seriously - just tell me what you prefer.

Paolo
Sean Christopherson Jan. 24, 2025, 11:44 p.m. UTC | #3
On Fri, Jan 24, 2025, Paolo Bonzini wrote:
> Il ven 24 gen 2025, 21:11 Sean Christopherson <seanjc@google.com> ha scritto:
> > Heh, except it's all kinds of broken.
> 
> Yes, I didn't even try.
> 
> > IMO, biting the bullet and converting to
> > an SRCU-protected list is going to be far less work in the long run.
> 
> I did try a long SRCU critical section and it was unreviewable. It
> ends up a lot less manageable than just making the lock a leaf,
> especially as the lock hierarchy spans multiple subsystems (static
> key, KVM, cpufreq---thanks CPU hotplug lock...).

I'm not following.  If __kvmclock_cpufreq_notifier() and set_nx_huge_pages()
switch to SRCU, then the deadlock goes away (it might even go away if just one
of those two switches).

SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
and if that's problematic then we would already have a plethora of issues.

> I also didn't like adding a synchronization primitive that's... kinda
> single-use, but that would not be a blocker of course.

It would be single use in the it only protects pure reader of vm_list, but there
are plenty of those users.

> So the second attempt was regular RCU, which looked a lot like this
> patch. I started writing all the dances to find a struct kvm that
> passes kvm_get_kvm_safe() before you do rcu_read_unlock() and drop the
> previous one (because you cannot do kvm_put_kvm() within the RCU read
> side) and set aside the idea, incorrectly thinking that they were not
> needed with kvm_lock. Plus I didn't like having to keep alive a bunch
> of data for a whole grace period if call_rcu() is used.
> 
> So for the third attempt I could have chosen between dropping the SRCU
> or just using kvm_lock. I didn't even think of SRCU to be honest,
> because everything so far looked so bad, but it does seem a little
> better than RCU. At least, if kvm_destroy_vm() uses call_srcu(), you
> can call kvm_put_kvm() within srcu_read_lock()...srcu_read_unlock().
> It would look something like
> 
>   list_for_each_entry_srcu(kvm, &vm_list, vm_list, 1) {
>     if (!kvm_get_kvm_safe(kvm))

Unless I'm missing something, we shouldn't need to take a reference so long as
SRCU is synchronized before destroying any part of the VM.  If we don't take a
reference, then we don't need to deal with the complexity of kvm_put_kvm()
creating a recursive lock snafu.

This is what I'm thinking, lightly tested...

---
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 24 Jan 2025 15:15:05 -0800
Subject: [PATCH] KVM: Use an SRCU lock to protect readers of vm_list

Introduce a global SRCU lock to protect KVM's global list of VMs, and use
it in all locations that currently take kvm_lock purely to prevent a VM
from being destroyed.

Keep using kvm_lock for flows that need to prevent VMs from being created,
as SRCU synchronization only guards against use-after-free, it doesn't
ensure a stable vm_list for readers.

This fixes a largely theoretical deadlock where:

  - __kvm_set_memory_region() waits for kvm->srcu with kvm->slots_lock taken
  - set_nx_huge_pages() waits for kvm->slots_lock with kvm_lock taken
  - __kvmclock_cpufreq_notifier() waits for kvm_lock with cpu_hotplug_lock taken
  - KVM_RUN waits for cpu_hotplug_lock with kvm->srcu taken

and therefore __kvm_set_memory_region() never completes
synchronize_srcu(&kvm->srcu).

  __kvm_set_memory_region()
    lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                             lock(&kvm->slots_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
  sync(&kvm->srcu)

Opportunistically add macros to walk the list of VMs, and the array of
vCPUs in each VMs, to cut down on the amount of boilerplate.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 19 +++++++++++------
 arch/x86/kvm/x86.c       | 36 +++++++++++++++++--------------
 include/linux/kvm_host.h |  9 ++++++++
 virt/kvm/kvm_main.c      | 46 +++++++++++++++++++++++++---------------
 4 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 74fa38ebddbf..f5b7ceb7ca0e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7127,6 +7127,10 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	} else if (sysfs_streq(val, "never")) {
 		new_val = 0;
 
+		/*
+		 * Take kvm_lock to ensure no VMs are *created* before the flag
+		 * is set.  vm_list_srcu only protect VMs being deleted.
+		 */
 		mutex_lock(&kvm_lock);
 		if (!list_empty(&vm_list)) {
 			mutex_unlock(&kvm_lock);
@@ -7142,17 +7146,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 
 	if (new_val != old_val) {
 		struct kvm *kvm;
+		int idx;
 
-		mutex_lock(&kvm_lock);
+		idx = srcu_read_lock(&vm_list_srcu);
 
-		list_for_each_entry(kvm, &vm_list, vm_list) {
+		kvm_for_each_vm_srcu(kvm) {
 			mutex_lock(&kvm->slots_lock);
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 		}
-		mutex_unlock(&kvm_lock);
+
+		srcu_read_unlock(&vm_list_srcu, idx);
 	}
 
 	return 0;
@@ -7275,13 +7281,14 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	if (is_recovery_enabled &&
 	    (!was_recovery_enabled || old_period > new_period)) {
 		struct kvm *kvm;
+		int idx;
 
-		mutex_lock(&kvm_lock);
+		idx = srcu_read_lock(&vm_list_srcu);
 
-		list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vm_srcu(kvm)
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 
-		mutex_unlock(&kvm_lock);
+		srcu_read_unlock(&vm_list_srcu, idx);
 	}
 
 	return err;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..8fb49237d179 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9428,6 +9428,11 @@ static void kvm_hyperv_tsc_notifier(void)
 	struct kvm *kvm;
 	int cpu;
 
+	/*
+	 * Take kvm_lock, not just vm_list_srcu, trevent new VMs from coming
+	 * along in the middle of the update and not getting the in-progress
+	 * request.
+	 */
 	mutex_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_make_mclock_inprogress_request(kvm);
@@ -9456,7 +9461,7 @@ static void __kvmclock_cpufreq_notifier(struct cpufreq_freqs *freq, int cpu)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int send_ipi = 0;
+	int send_ipi = 0, idx;
 	unsigned long i;
 
 	/*
@@ -9500,17 +9505,16 @@ static void __kvmclock_cpufreq_notifier(struct cpufreq_freqs *freq, int cpu)
 
 	smp_call_function_single(cpu, tsc_khz_changed, freq, 1);
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (vcpu->cpu != cpu)
-				continue;
-			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-			if (vcpu->cpu != raw_smp_processor_id())
-				send_ipi = 1;
-		}
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vcpu_in_each_vm(kvm, vcpu, i) {
+		if (vcpu->cpu != cpu)
+			continue;
+
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		if (vcpu->cpu != raw_smp_processor_id())
+			send_ipi = 1;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	if (freq->old < freq->new && send_ipi) {
 		/*
@@ -9588,13 +9592,13 @@ static void pvclock_gtod_update_fn(struct work_struct *work)
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
+	int idx;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
-			kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vcpu_in_each_vm(kvm, vcpu, i)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	srcu_read_unlock(&vm_list_srcu, idx);
 	atomic_set(&kvm_guest_has_master_clock, 0);
-	mutex_unlock(&kvm_lock);
 }
 
 static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9df590e8f3da..0d0edb697160 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -193,6 +193,11 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
 
 extern struct mutex kvm_lock;
 extern struct list_head vm_list;
+extern struct srcu_struct vm_list_srcu;
+
+#define kvm_for_each_vm_srcu(__kvm)				\
+	list_for_each_entry_srcu(__kvm, &vm_list, vm_list,	\
+				 srcu_read_lock_held(&vm_list_srcu))
 
 struct kvm_io_range {
 	gpa_t addr;
@@ -1001,6 +1006,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
 	return NULL;
 }
 
+#define kvm_for_each_vcpu_in_each_vm(__kvm, __vcpu, __i)		\
+	kvm_for_each_vm_srcu(__kvm)					\
+		kvm_for_each_vcpu(__i, __vcpu, __kvm)
+
 void kvm_destroy_vcpus(struct kvm *kvm);
 
 void vcpu_load(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e0b9d6dd6a85..7fcc4433bf35 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -109,6 +109,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
 
 DEFINE_MUTEX(kvm_lock);
 LIST_HEAD(vm_list);
+DEFINE_SRCU(vm_list_srcu);
 
 static struct kmem_cache *kvm_vcpu_cache;
 
@@ -1261,13 +1262,21 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	mutex_lock(&kvm_lock);
+	list_del(&kvm->vm_list);
+	mutex_unlock(&kvm_lock);
+
+	/*
+	 * Ensure all readers of the global list go away before destroying any
+	 * aspect of the VM.  After this, the VM object is reachable only via
+	 * this task and notifiers that are registered to the VM itself.
+	 */
+	synchronize_srcu(&vm_list_srcu);
+
 	kvm_destroy_pm_notifier(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
-	mutex_lock(&kvm_lock);
-	list_del(&kvm->vm_list);
-	mutex_unlock(&kvm_lock);
 	kvm_arch_pre_destroy_vm(kvm);
 
 	kvm_free_irq_routing(kvm);
@@ -6096,14 +6105,16 @@ static int vm_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	u64 tmp_val;
+	int idx;
 
 	*val = 0;
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm) {
 		kvm_get_stat_per_vm(kvm, offset, &tmp_val);
 		*val += tmp_val;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
+
 	return 0;
 }
 
@@ -6111,15 +6122,15 @@ static int vm_stat_clear(void *_offset, u64 val)
 {
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
+	int idx;
 
 	if (val)
 		return -EINVAL;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm)
 		kvm_clear_stat_per_vm(kvm, offset);
-	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	return 0;
 }
@@ -6132,14 +6143,15 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	u64 tmp_val;
+	int idx;
 
 	*val = 0;
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm) {
 		kvm_get_stat_per_vcpu(kvm, offset, &tmp_val);
 		*val += tmp_val;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 	return 0;
 }
 
@@ -6147,15 +6159,15 @@ static int vcpu_stat_clear(void *_offset, u64 val)
 {
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
+	int idx;
 
 	if (val)
 		return -EINVAL;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm)
 		kvm_clear_stat_per_vcpu(kvm, offset);
-	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	return 0;
 }

base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
--
Paolo Bonzini Jan. 25, 2025, 12:08 a.m. UTC | #4
On 1/25/25 00:44, Sean Christopherson wrote:
>> I did try a long SRCU critical section and it was unreviewable. It
>> ends up a lot less manageable than just making the lock a leaf,
>> especially as the lock hierarchy spans multiple subsystems (static
>> key, KVM, cpufreq---thanks CPU hotplug lock...).
> 
> I'm not following.  If __kvmclock_cpufreq_notifier() and set_nx_huge_pages()
> switch to SRCU, then the deadlock goes away (it might even go away if just one
> of those two switches). [...] It would be single use in the it only
> protects pure reader of vm_list, but there are plenty of those users.

Yes, single use in the sense that only set_nx_huge_pages() really needs 
it.  kvm_lock doesn't produce any noticeable contention and as you noted 
sometimes you really need it even in pure readers.

> SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
> and if that's problematic then we would already have a plethora of issues.

Ah yeah, I missed that you cannot hold any lock when calling 
kvm_put_kvm().  So the waiting side is indeed a leaf and cannot block 
someone else.

Still from your patch (thanks!) I don't really like the special cases on 
taking SRCU vs. kvm_lock... It really seems like a job for a mutex or 
rwsem.  It keeps the complexity in the one place that is different (i.e. 
where a lock is taken inside the iteration) and everything else can just 
iterate normally.

Paolo
Sean Christopherson Jan. 25, 2025, 12:44 a.m. UTC | #5
On Sat, Jan 25, 2025, Paolo Bonzini wrote:
> On 1/25/25 00:44, Sean Christopherson wrote:
> > SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
> > and if that's problematic then we would already have a plethora of issues.
> 
> Ah yeah, I missed that you cannot hold any lock when calling kvm_put_kvm().
> So the waiting side is indeed a leaf and cannot block someone else.
> 
> Still from your patch (thanks!) I don't really like the special cases on
> taking SRCU vs. kvm_lock... It really seems like a job for a mutex or rwsem.
> It keeps the complexity in the one place that is different (i.e. where a
> lock is taken inside the iteration) and everything else can just iterate
> normally.

I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
the set_nx_huge_pages() "never" path is a write in disguise, and
kvm_hyperv_tsc_notifier() is a very special snowflake.
Paolo Bonzini Jan. 27, 2025, 5:27 p.m. UTC | #6
On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> the set_nx_huge_pages() "never" path is a write in disguise, and
> kvm_hyperv_tsc_notifier() is a very special snowflake.

set_nx_huge_pages() is not a writer in disguise.  Rather, it's
a *real* writer for nx_hugepage_mitigation_hard_disabled which is
also protected by kvm_lock; and there's a (mostly theoretical)
bug in set_nx_huge_pages_recovery_param() which reads it without
taking the lock.  But it's still a reader as far as vm_list is
concerned.

Likewise, kvm_hyperv_tsc_notifier()'s requirement does deserve a comment,
but its specialness is self-inflicted pain due to using (S)RCU even when
it's not the most appropriate synchronization mechanism.

Paolo
Paolo Bonzini Jan. 27, 2025, 5:56 p.m. UTC | #7
On Mon, Jan 27, 2025 at 6:27 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > the set_nx_huge_pages() "never" path is a write in disguise, and
> > kvm_hyperv_tsc_notifier() is a very special snowflake.
>
> Likewise, kvm_hyperv_tsc_notifier()'s requirement does deserve a comment,
> but its specialness is self-inflicted pain due to using (S)RCU even when
> it's not the most appropriate synchronization mechanism.

... in fact, you could have a KVM_CREATE_VCPU and KVM_RUN after this
point:

        mutex_lock(&kvm_lock);
        list_for_each_entry(kvm, &vm_list, vm_list)
                kvm_make_mclock_inprogress_request(kvm);

because kvm_lock is not enough to ensure that all vCPUs got the
KVM_REQ_MCLOCK_INPROGRESS memo.  So kvm_hyperv_tsc_notifier()'s
complications go beyond kvm_lock and the choice to use SRCU or not.

Paolo
Sean Christopherson Jan. 27, 2025, 6:01 p.m. UTC | #8
On Mon, Jan 27, 2025, Paolo Bonzini wrote:
> On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > the set_nx_huge_pages() "never" path is a write in disguise, and
> > kvm_hyperv_tsc_notifier() is a very special snowflake.
> 
> set_nx_huge_pages() is not a writer in disguise.  Rather, it's
> a *real* writer for nx_hugepage_mitigation_hard_disabled which is
> also protected by kvm_lock;

Heh, agreed, I was trying to say that it's a write that is disguised as a reader.

> and there's a (mostly theoretical) bug in set_nx_huge_pages_recovery_param()
> which reads it without taking the lock.

It's arguably not a bug.  Userspace has no visibility into the order in which
param writes are processed.  If there are racing writes to the period/ratio and
"never", both outcomes are legal (rejected with -EPERM or period/ratio changes).
If nx_hugepage_mitigation_hard_disabled becomes set after the params are changed,
then vm_list is guaranteed to be empty, so the wakeup walk is still a nop.
Paolo Bonzini Jan. 27, 2025, 6:17 p.m. UTC | #9
On Mon, Jan 27, 2025 at 7:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jan 27, 2025, Paolo Bonzini wrote:
> > On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > > the set_nx_huge_pages() "never" path is a write in disguise, and
> > > kvm_hyperv_tsc_notifier() is a very special snowflake.
> >
> > set_nx_huge_pages() is not a writer in disguise.  Rather, it's
> > a *real* writer for nx_hugepage_mitigation_hard_disabled which is
> > also protected by kvm_lock;
>
> Heh, agreed, I was trying to say that it's a write that is disguised as a reader.
>
> > and there's a (mostly theoretical) bug in set_nx_huge_pages_recovery_param()
> > which reads it without taking the lock.
>
> It's arguably not a bug.  Userspace has no visibility into the order in which
> param writes are processed.  If there are racing writes to the period/ratio and
> "never", both outcomes are legal (rejected with -EPERM or period/ratio changes).
> If nx_hugepage_mitigation_hard_disabled becomes set after the params are changed,
> then vm_list is guaranteed to be empty, so the wakeup walk is still a nop.

I think we have enough arguably necessary lockless cases to think about
linearizability of sysfs writes and reads of vm_list. :)  If you agree,
set_nx_huge_pages() and set_nx_huge_pages_recovery_param() can be
changed to simply have a guard(mutex)(&kvm_lock) around them, instead
of protecting just the vm_list walk.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2401606db260..1d8b45e7bb94 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7114,6 +7114,7 @@  static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	bool old_val = nx_huge_pages;
 	bool new_val;
 
+	guard(mutex)(&kvm_lock);
 	if (nx_hugepage_mitigation_hard_disabled)
 		return -EPERM;
 
@@ -7127,13 +7128,10 @@  static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	} else if (sysfs_streq(val, "never")) {
 		new_val = 0;
 
-		mutex_lock(&kvm_lock);
 		if (!list_empty(&vm_list)) {
-			mutex_unlock(&kvm_lock);
 			return -EBUSY;
 		}
 		nx_hugepage_mitigation_hard_disabled = true;
-		mutex_unlock(&kvm_lock);
 	} else if (kstrtobool(val, &new_val) < 0) {
 		return -EINVAL;
 	}
@@ -7143,16 +7141,19 @@  static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	if (new_val != old_val) {
 		struct kvm *kvm;
 
-		mutex_lock(&kvm_lock);
-
 		list_for_each_entry(kvm, &vm_list, vm_list) {
+			kvm_get_kvm(kvm);
+			mutex_unlock(&kvm_lock);
+
 			mutex_lock(&kvm->slots_lock);
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+
+			mutex_lock(&kvm_lock);
+			kvm_put_kvm(kvm);
 		}
-		mutex_unlock(&kvm_lock);
 	}
 
 	return 0;