diff mbox series

kvm: defer huge page recovery vhost task to later

Message ID 20250123153543.2769928-1-kbusch@meta.com (mailing list archive)
State New
Headers show
Series kvm: defer huge page recovery vhost task to later | expand

Commit Message

Keith Busch Jan. 23, 2025, 3:35 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.

Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so this patch introduces infrastructure
to do that (Suggested-by Paolo).

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 arch/x86/include/asm/kvm_call_once.h | 44 ++++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h      |  2 ++
 arch/x86/kvm/mmu/mmu.c               | 18 ++++++++----
 arch/x86/kvm/x86.c                   |  7 ++++-
 4 files changed, 65 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/kvm_call_once.h

Comments

Paolo Bonzini Jan. 24, 2025, 3:28 p.m. UTC | #1
> Defer the task to after the first VM_RUN call, which occurs after the
> parent process has forked all its jailed processes. This needs to happen
> only once for the kvm instance, so this patch introduces infrastructure
> to do that (Suggested-by Paolo).

Queued for 6.13; in the end I moved the new data structure to include/linux,
since it is generally usable and not limited to KVM.

>  int kvm_arch_post_init_vm(struct kvm *kvm)
>  {
> -	return kvm_mmu_post_init_vm(kvm);
> +	once_init(&kvm->arch.nx_once);
> +	return 0;
>  }

This could have been in kvm_arch_init_vm(), but then the last user of
kvm_arch_post_init_vm() goes away and more cleanup is in order.  I'll
post the obvious patch shortly.

Thanks Keith and Alyssa.

Paolo
Keith Busch Jan. 24, 2025, 4:48 p.m. UTC | #2
On Fri, Jan 24, 2025 at 10:28:03AM -0500, Paolo Bonzini wrote:
> > Defer the task to after the first VM_RUN call, which occurs after the
> > parent process has forked all its jailed processes. This needs to happen
> > only once for the kvm instance, so this patch introduces infrastructure
> > to do that (Suggested-by Paolo).
> 
> Queued for 6.13; in the end I moved the new data structure to include/linux,
> since it is generally usable and not limited to KVM.

Thanks! I see that you also added the "Fixes" tag that I forgot to
append in the most recent version, so thank you for that.
 
> >  int kvm_arch_post_init_vm(struct kvm *kvm)
> >  {
> > -	return kvm_mmu_post_init_vm(kvm);
> > +	once_init(&kvm->arch.nx_once);
> > +	return 0;
> >  }
> 
> This could have been in kvm_arch_init_vm(), but then the last user of
> kvm_arch_post_init_vm() goes away and more cleanup is in order.  I'll
> post the obvious patch shortly.

Yes, that makes sense. I had a similiar cleanup in the first version
too.
Sean Christopherson Jan. 24, 2025, 8:07 p.m. UTC | #3
On Thu, Jan 23, 2025, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Some libraries want to ensure they are single threaded before forking,
> so making the kernel's kvm huge page recovery process a vhost task of
> the user process breaks those. The minijail library used by crosvm is
> one such affected application.
> 
> Defer the task to after the first VM_RUN call, which occurs after the
> parent process has forked all its jailed processes. This needs to happen
> only once for the kvm instance, so this patch introduces infrastructure
> to do that (Suggested-by Paolo).
> 
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Tested-by: Alyssa Ross <hi@alyssa.is>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 26b4ba7e7cb5e..a45ae60e84ab4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7447,20 +7447,28 @@ static bool kvm_nx_huge_page_recovery_worker(void *data)
>  	return true;
>  }
>  
> -int kvm_mmu_post_init_vm(struct kvm *kvm)
> +static void kvm_mmu_start_lpage_recovery(struct once *once)
>  {
> -	if (nx_hugepage_mitigation_hard_disabled)
> -		return 0;
> +	struct kvm_arch *ka = container_of(once, struct kvm_arch, nx_once);
> +	struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
>  	kvm->arch.nx_huge_page_last = get_jiffies_64();
>  	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
>  		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
>  		kvm, "kvm-nx-lpage-recovery");
>  
> +	if (kvm->arch.nx_huge_page_recovery_thread)
> +		vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
> +}
> +
> +int kvm_mmu_post_init_vm(struct kvm *kvm)
> +{
> +	if (nx_hugepage_mitigation_hard_disabled)
> +		return 0;
> +
> +	call_once(&kvm->arch.nx_once, kvm_mmu_start_lpage_recovery);
>  	if (!kvm->arch.nx_huge_page_recovery_thread)
>  		return -ENOMEM;
> -
> -	vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6e248152fa134..6d4a6734b2d69 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11471,6 +11471,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	struct kvm_run *kvm_run = vcpu->run;
>  	int r;
>  
> +	r = kvm_mmu_post_init_vm(vcpu->kvm);
> +	if (r)
> +		return r;

This is broken.  If the module param is toggled before the first KVM_RUN, KVM
will hit a NULL pointer deref due to trying to start a non-existent vhost task:

  BUG: kernel NULL pointer dereference, address: 0000000000000040
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0 
  Oops: Oops: 0000 [#1] SMP
  CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:vhost_task_wake+0x5/0x10
  Call Trace:
   <TASK>
   set_nx_huge_pages+0xcc/0x1e0 [kvm]
   param_attr_store+0x8a/0xd0
   module_attr_store+0x1a/0x30
   kernfs_fop_write_iter+0x12f/0x1e0
   vfs_write+0x233/0x3e0
   ksys_write+0x60/0xd0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f3b52710104
   </TASK>
  Modules linked in: kvm_intel kvm
  CR2: 0000000000000040
  ---[ end trace 0000000000000000 ]---
Keith Busch Jan. 24, 2025, 8:54 p.m. UTC | #4
On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote:
> This is broken.  If the module param is toggled before the first KVM_RUN, KVM
> will hit a NULL pointer deref due to trying to start a non-existent vhost task:
> 
>   BUG: kernel NULL pointer dereference, address: 0000000000000040
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 0 P4D 0 
>   Oops: Oops: 0000 [#1] SMP
>   CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:vhost_task_wake+0x5/0x10
>   Call Trace:
>    <TASK>
>    set_nx_huge_pages+0xcc/0x1e0 [kvm]

Thanks for pointing out this gap. It looks like we'd have to hold the
kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in
set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent
the NULL deref. Is that okay?
Sean Christopherson Jan. 25, 2025, 12:10 a.m. UTC | #5
On Fri, Jan 24, 2025, Keith Busch wrote:
> On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote:
> > This is broken.  If the module param is toggled before the first KVM_RUN, KVM
> > will hit a NULL pointer deref due to trying to start a non-existent vhost task:
> > 
> >   BUG: kernel NULL pointer dereference, address: 0000000000000040
> >   #PF: supervisor read access in kernel mode
> >   #PF: error_code(0x0000) - not-present page
> >   PGD 0 P4D 0 
> >   Oops: Oops: 0000 [#1] SMP
> >   CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:vhost_task_wake+0x5/0x10
> >   Call Trace:
> >    <TASK>
> >    set_nx_huge_pages+0xcc/0x1e0 [kvm]
> 
> Thanks for pointing out this gap. It looks like we'd have to hold the
> kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in
> set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent
> the NULL deref. Is that okay?

I don't _think_ we need to take kvm_lock.  And I don't want to take kvm_lock,
because we're also trying to eliminate a (very theoretical) deadlock[1] due to
taking kvm_lock in the params helpers.

There is a race that can happen with my proposed fix[2], but I'm not sure we care
enough to address it.  If kvm_nx_huge_page_recovery_worker() runs before the params
are set, and the param setter processes the VM before nx_huge_page_recovery_thread
is set, then the worker could sleep for too long, relative to what userspace expects.

I suppose if we care then we could fix that by taking kvm->arch.nx_once.mutex
when waking the task?

[1] https://lore.kernel.org/all/20250124191109.205955-2-pbonzini@redhat.com
[2] https://lore.kernel.org/all/20250124234623.3609069-1-seanjc@google.com
Keith Busch Jan. 25, 2025, 4:05 a.m. UTC | #6
On Fri, Jan 24, 2025 at 04:10:45PM -0800, Sean Christopherson wrote:
> On Fri, Jan 24, 2025, Keith Busch wrote:
> > On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote:
> > > This is broken.  If the module param is toggled before the first KVM_RUN, KVM
> > > will hit a NULL pointer deref due to trying to start a non-existent vhost task:
> > > 
> > >   BUG: kernel NULL pointer dereference, address: 0000000000000040
> > >   #PF: supervisor read access in kernel mode
> > >   #PF: error_code(0x0000) - not-present page
> > >   PGD 0 P4D 0 
> > >   Oops: Oops: 0000 [#1] SMP
> > >   CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382
> > >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >   RIP: 0010:vhost_task_wake+0x5/0x10
> > >   Call Trace:
> > >    <TASK>
> > >    set_nx_huge_pages+0xcc/0x1e0 [kvm]
> > 
> > Thanks for pointing out this gap. It looks like we'd have to hold the
> > kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in
> > set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent
> > the NULL deref. Is that okay?
> 
> I don't _think_ we need to take kvm_lock.  And I don't want to take kvm_lock,
> because we're also trying to eliminate a (very theoretical) deadlock[1] due to
> taking kvm_lock in the params helpers.
> 
> There is a race that can happen with my proposed fix[2], but I'm not sure we care
> enough to address it.  If kvm_nx_huge_page_recovery_worker() runs before the params
> are set, and the param setter processes the VM before nx_huge_page_recovery_thread
> is set, then the worker could sleep for too long, relative to what userspace expects.
> 
> I suppose if we care then we could fix that by taking kvm->arch.nx_once.mutex
> when waking the task?

I think we actually can do this without any additional locks. The only
thing we need to ensure is that the vhost task sees the updated
variable, and I think we can achieve that with appropriate memory
barriers on the reads and writes.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_call_once.h b/arch/x86/include/asm/kvm_call_once.h
new file mode 100644
index 0000000000000..451cc87084aa7
--- /dev/null
+++ b/arch/x86/include/asm/kvm_call_once.h
@@ -0,0 +1,44 @@ 
+#ifndef _LINUX_CALL_ONCE_H
+#define _LINUX_CALL_ONCE_H
+
+#include <linux/types.h>
+
+#define ONCE_NOT_STARTED 0
+#define ONCE_RUNNING     1
+#define ONCE_COMPLETED   2
+
+struct once {
+        atomic_t state;
+        struct mutex lock;
+};
+
+static inline void __once_init(struct once *once, const char *name,
+			       struct lock_class_key *key)
+{
+        atomic_set(&once->state, ONCE_NOT_STARTED);
+        __mutex_init(&once->lock, name, key);
+}
+
+#define once_init(once)							\
+do {									\
+	static struct lock_class_key __key;				\
+	__once_init((once), #once, &__key);				\
+} while (0)
+
+static inline void call_once(struct once *once, void (*cb)(struct once *))
+{
+        /* Pairs with atomic_set_release() below.  */
+        if (atomic_read_acquire(&once->state) == ONCE_COMPLETED)
+                return;
+
+        guard(mutex)(&once->lock);
+        WARN_ON(atomic_read(&once->state) == ONCE_RUNNING);
+        if (atomic_read(&once->state) != ONCE_NOT_STARTED)
+                return;
+
+        atomic_set(&once->state, ONCE_RUNNING);
+        cb(once);
+        atomic_set_release(&once->state, ONCE_COMPLETED);
+}
+
+#endif /* _LINUX_CALL_ONCE_H */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2f442701dc755..e1eb8155e6a82 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@ 
 #include <asm/kvm_page_track.h>
 #include <asm/kvm_vcpu_regs.h>
 #include <asm/hyperv-tlfs.h>
+#include <asm/kvm_call_once.h>
 #include <asm/reboot.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
@@ -1466,6 +1467,7 @@  struct kvm_arch {
 	struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;
 	struct vhost_task *nx_huge_page_recovery_thread;
 	u64 nx_huge_page_last;
+	struct once nx_once;
 
 #ifdef CONFIG_X86_64
 	/* The number of TDP MMU pages across all roots. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 26b4ba7e7cb5e..a45ae60e84ab4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7447,20 +7447,28 @@  static bool kvm_nx_huge_page_recovery_worker(void *data)
 	return true;
 }
 
-int kvm_mmu_post_init_vm(struct kvm *kvm)
+static void kvm_mmu_start_lpage_recovery(struct once *once)
 {
-	if (nx_hugepage_mitigation_hard_disabled)
-		return 0;
+	struct kvm_arch *ka = container_of(once, struct kvm_arch, nx_once);
+	struct kvm *kvm = container_of(ka, struct kvm, arch);
 
 	kvm->arch.nx_huge_page_last = get_jiffies_64();
 	kvm->arch.nx_huge_page_recovery_thread = vhost_task_create(
 		kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill,
 		kvm, "kvm-nx-lpage-recovery");
 
+	if (kvm->arch.nx_huge_page_recovery_thread)
+		vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
+}
+
+int kvm_mmu_post_init_vm(struct kvm *kvm)
+{
+	if (nx_hugepage_mitigation_hard_disabled)
+		return 0;
+
+	call_once(&kvm->arch.nx_once, kvm_mmu_start_lpage_recovery);
 	if (!kvm->arch.nx_huge_page_recovery_thread)
 		return -ENOMEM;
-
-	vhost_task_start(kvm->arch.nx_huge_page_recovery_thread);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e248152fa134..6d4a6734b2d69 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11471,6 +11471,10 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
+	r = kvm_mmu_post_init_vm(vcpu->kvm);
+	if (r)
+		return r;
+
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
 	kvm_run->flags = 0;
@@ -12748,7 +12752,8 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 int kvm_arch_post_init_vm(struct kvm *kvm)
 {
-	return kvm_mmu_post_init_vm(kvm);
+	once_init(&kvm->arch.nx_once);
+	return 0;
 }
 
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)