Message ID | 20250123153543.2769928-1-kbusch@meta.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | kvm: defer huge page recovery vhost task to later | expand |
> Defer the task to after the first VM_RUN call, which occurs after the > parent process has forked all its jailed processes. This needs to happen > only once for the kvm instance, so this patch introduces infrastructure > to do that (Suggested-by Paolo). Queued for 6.13; in the end I moved the new data structure to include/linux, since it is generally usable and not limited to KVM. > int kvm_arch_post_init_vm(struct kvm *kvm) > { > - return kvm_mmu_post_init_vm(kvm); > + once_init(&kvm->arch.nx_once); > + return 0; > } This could have been in kvm_arch_init_vm(), but then the last user of kvm_arch_post_init_vm() goes away and more cleanup is in order. I'll post the obvious patch shortly. Thanks Keith and Alyssa. Paolo
On Fri, Jan 24, 2025 at 10:28:03AM -0500, Paolo Bonzini wrote: > > Defer the task to after the first VM_RUN call, which occurs after the > > parent process has forked all its jailed processes. This needs to happen > > only once for the kvm instance, so this patch introduces infrastructure > > to do that (Suggested-by Paolo). > > Queued for 6.13; in the end I moved the new data structure to include/linux, > since it is generally usable and not limited to KVM. Thanks! I see that you also added the "Fixes" tag that I forgot to append in the most recent version, so thank you for that. > > int kvm_arch_post_init_vm(struct kvm *kvm) > > { > > - return kvm_mmu_post_init_vm(kvm); > > + once_init(&kvm->arch.nx_once); > > + return 0; > > } > > This could have been in kvm_arch_init_vm(), but then the last user of > kvm_arch_post_init_vm() goes away and more cleanup is in order. I'll > post the obvious patch shortly. Yes, that makes sense. I had a similiar cleanup in the first version too.
On Thu, Jan 23, 2025, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > Some libraries want to ensure they are single threaded before forking, > so making the kernel's kvm huge page recovery process a vhost task of > the user process breaks those. The minijail library used by crosvm is > one such affected application. > > Defer the task to after the first VM_RUN call, which occurs after the > parent process has forked all its jailed processes. This needs to happen > only once for the kvm instance, so this patch introduces infrastructure > to do that (Suggested-by Paolo). > > Cc: Sean Christopherson <seanjc@google.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Tested-by: Alyssa Ross <hi@alyssa.is> > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 26b4ba7e7cb5e..a45ae60e84ab4 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -7447,20 +7447,28 @@ static bool kvm_nx_huge_page_recovery_worker(void *data) > return true; > } > > -int kvm_mmu_post_init_vm(struct kvm *kvm) > +static void kvm_mmu_start_lpage_recovery(struct once *once) > { > - if (nx_hugepage_mitigation_hard_disabled) > - return 0; > + struct kvm_arch *ka = container_of(once, struct kvm_arch, nx_once); > + struct kvm *kvm = container_of(ka, struct kvm, arch); > > kvm->arch.nx_huge_page_last = get_jiffies_64(); > kvm->arch.nx_huge_page_recovery_thread = vhost_task_create( > kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill, > kvm, "kvm-nx-lpage-recovery"); > > + if (kvm->arch.nx_huge_page_recovery_thread) > + vhost_task_start(kvm->arch.nx_huge_page_recovery_thread); > +} > + > +int kvm_mmu_post_init_vm(struct kvm *kvm) > +{ > + if (nx_hugepage_mitigation_hard_disabled) > + return 0; > + > + call_once(&kvm->arch.nx_once, kvm_mmu_start_lpage_recovery); > if (!kvm->arch.nx_huge_page_recovery_thread) > return -ENOMEM; > - > - vhost_task_start(kvm->arch.nx_huge_page_recovery_thread); > return 0; > } > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6e248152fa134..6d4a6734b2d69 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11471,6 +11471,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > struct kvm_run *kvm_run = vcpu->run; > int r; > > + r = kvm_mmu_post_init_vm(vcpu->kvm); > + if (r) > + return r; This is broken. If the module param is toggled before the first KVM_RUN, KVM will hit a NULL pointer deref due to trying to start a non-existent vhost task: BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:vhost_task_wake+0x5/0x10 Call Trace: <TASK> set_nx_huge_pages+0xcc/0x1e0 [kvm] param_attr_store+0x8a/0xd0 module_attr_store+0x1a/0x30 kernfs_fop_write_iter+0x12f/0x1e0 vfs_write+0x233/0x3e0 ksys_write+0x60/0xd0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f3b52710104 </TASK> Modules linked in: kvm_intel kvm CR2: 0000000000000040 ---[ end trace 0000000000000000 ]---
On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote: > This is broken. If the module param is toggled before the first KVM_RUN, KVM > will hit a NULL pointer deref due to trying to start a non-existent vhost task: > > BUG: kernel NULL pointer dereference, address: 0000000000000040 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: Oops: 0000 [#1] SMP > CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:vhost_task_wake+0x5/0x10 > Call Trace: > <TASK> > set_nx_huge_pages+0xcc/0x1e0 [kvm] Thanks for pointing out this gap. It looks like we'd have to hold the kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent the NULL deref. Is that okay?
On Fri, Jan 24, 2025, Keith Busch wrote: > On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote: > > This is broken. If the module param is toggled before the first KVM_RUN, KVM > > will hit a NULL pointer deref due to trying to start a non-existent vhost task: > > > > BUG: kernel NULL pointer dereference, address: 0000000000000040 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: Oops: 0000 [#1] SMP > > CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > RIP: 0010:vhost_task_wake+0x5/0x10 > > Call Trace: > > <TASK> > > set_nx_huge_pages+0xcc/0x1e0 [kvm] > > Thanks for pointing out this gap. It looks like we'd have to hold the > kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in > set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent > the NULL deref. Is that okay? I don't _think_ we need to take kvm_lock. And I don't want to take kvm_lock, because we're also trying to eliminate a (very theoretical) deadlock[1] due to taking kvm_lock in the params helpers. There is a race that can happen with my proposed fix[2], but I'm not sure we care enough to address it. If kvm_nx_huge_page_recovery_worker() runs before the params are set, and the param setter processes the VM before nx_huge_page_recovery_thread is set, then the worker could sleep for too long, relative to what userspace expects. I suppose if we care then we could fix that by taking kvm->arch.nx_once.mutex when waking the task? [1] https://lore.kernel.org/all/20250124191109.205955-2-pbonzini@redhat.com [2] https://lore.kernel.org/all/20250124234623.3609069-1-seanjc@google.com
On Fri, Jan 24, 2025 at 04:10:45PM -0800, Sean Christopherson wrote: > On Fri, Jan 24, 2025, Keith Busch wrote: > > On Fri, Jan 24, 2025 at 12:07:24PM -0800, Sean Christopherson wrote: > > > This is broken. If the module param is toggled before the first KVM_RUN, KVM > > > will hit a NULL pointer deref due to trying to start a non-existent vhost task: > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000040 > > > #PF: supervisor read access in kernel mode > > > #PF: error_code(0x0000) - not-present page > > > PGD 0 P4D 0 > > > Oops: Oops: 0000 [#1] SMP > > > CPU: 16 UID: 0 PID: 1190 Comm: bash Not tainted 6.13.0-rc3-9bb02e874121-x86/xen_msr_fixes-vm #2382 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > > > RIP: 0010:vhost_task_wake+0x5/0x10 > > > Call Trace: > > > <TASK> > > > set_nx_huge_pages+0xcc/0x1e0 [kvm] > > > > Thanks for pointing out this gap. It looks like we'd have to hold the > > kvm_lock in kvm_mmu_post_init_vm(), and add NULL checks in > > set_nx_huge_pages() and set_nx_huge_pages_recovery_param() to prevent > > the NULL deref. Is that okay? > > I don't _think_ we need to take kvm_lock. And I don't want to take kvm_lock, > because we're also trying to eliminate a (very theoretical) deadlock[1] due to > taking kvm_lock in the params helpers. > > There is a race that can happen with my proposed fix[2], but I'm not sure we care > enough to address it. If kvm_nx_huge_page_recovery_worker() runs before the params > are set, and the param setter processes the VM before nx_huge_page_recovery_thread > is set, then the worker could sleep for too long, relative to what userspace expects. > > I suppose if we care then we could fix that by taking kvm->arch.nx_once.mutex > when waking the task? I think we actually can do this without any additional locks. The only thing we need to ensure is that the vhost task sees the updated variable, and I think we can achieve that with appropriate memory barriers on the reads and writes.
diff --git a/arch/x86/include/asm/kvm_call_once.h b/arch/x86/include/asm/kvm_call_once.h new file mode 100644 index 0000000000000..451cc87084aa7 --- /dev/null +++ b/arch/x86/include/asm/kvm_call_once.h @@ -0,0 +1,44 @@ +#ifndef _LINUX_CALL_ONCE_H +#define _LINUX_CALL_ONCE_H + +#include <linux/types.h> + +#define ONCE_NOT_STARTED 0 +#define ONCE_RUNNING 1 +#define ONCE_COMPLETED 2 + +struct once { + atomic_t state; + struct mutex lock; +}; + +static inline void __once_init(struct once *once, const char *name, + struct lock_class_key *key) +{ + atomic_set(&once->state, ONCE_NOT_STARTED); + __mutex_init(&once->lock, name, key); +} + +#define once_init(once) \ +do { \ + static struct lock_class_key __key; \ + __once_init((once), #once, &__key); \ +} while (0) + +static inline void call_once(struct once *once, void (*cb)(struct once *)) +{ + /* Pairs with atomic_set_release() below. */ + if (atomic_read_acquire(&once->state) == ONCE_COMPLETED) + return; + + guard(mutex)(&once->lock); + WARN_ON(atomic_read(&once->state) == ONCE_RUNNING); + if (atomic_read(&once->state) != ONCE_NOT_STARTED) + return; + + atomic_set(&once->state, ONCE_RUNNING); + cb(once); + atomic_set_release(&once->state, ONCE_COMPLETED); +} + +#endif /* _LINUX_CALL_ONCE_H */ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2f442701dc755..e1eb8155e6a82 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -37,6 +37,7 @@ #include <asm/kvm_page_track.h> #include <asm/kvm_vcpu_regs.h> #include <asm/hyperv-tlfs.h> +#include <asm/kvm_call_once.h> #include <asm/reboot.h> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS @@ -1466,6 +1467,7 @@ struct kvm_arch { struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter; struct vhost_task *nx_huge_page_recovery_thread; u64 nx_huge_page_last; + struct once nx_once; #ifdef CONFIG_X86_64 /* The number of TDP MMU pages across all roots. */ diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 26b4ba7e7cb5e..a45ae60e84ab4 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7447,20 +7447,28 @@ static bool kvm_nx_huge_page_recovery_worker(void *data) return true; } -int kvm_mmu_post_init_vm(struct kvm *kvm) +static void kvm_mmu_start_lpage_recovery(struct once *once) { - if (nx_hugepage_mitigation_hard_disabled) - return 0; + struct kvm_arch *ka = container_of(once, struct kvm_arch, nx_once); + struct kvm *kvm = container_of(ka, struct kvm, arch); kvm->arch.nx_huge_page_last = get_jiffies_64(); kvm->arch.nx_huge_page_recovery_thread = vhost_task_create( kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill, kvm, "kvm-nx-lpage-recovery"); + if (kvm->arch.nx_huge_page_recovery_thread) + vhost_task_start(kvm->arch.nx_huge_page_recovery_thread); +} + +int kvm_mmu_post_init_vm(struct kvm *kvm) +{ + if (nx_hugepage_mitigation_hard_disabled) + return 0; + + call_once(&kvm->arch.nx_once, kvm_mmu_start_lpage_recovery); if (!kvm->arch.nx_huge_page_recovery_thread) return -ENOMEM; - - vhost_task_start(kvm->arch.nx_huge_page_recovery_thread); return 0; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6e248152fa134..6d4a6734b2d69 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11471,6 +11471,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) struct kvm_run *kvm_run = vcpu->run; int r; + r = kvm_mmu_post_init_vm(vcpu->kvm); + if (r) + return r; + vcpu_load(vcpu); kvm_sigset_activate(vcpu); kvm_run->flags = 0; @@ -12748,7 +12752,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) int kvm_arch_post_init_vm(struct kvm *kvm) { - return kvm_mmu_post_init_vm(kvm); + once_init(&kvm->arch.nx_once); + return 0; } static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)