diff mbox series

[1/2] KVM: x86: Fix deadlock in kvm_vm_ioctl_set_msr_filter()

Message ID 20221229211737.138861-2-mhal@rbox.co (mailing list archive)
State New, archived
Headers show
Series Fix deadlocks in kvm_vm_ioctl_set_msr_filter() and | expand

Commit Message

Michal Luczaj Dec. 29, 2022, 9:17 p.m. UTC
Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 arch/x86/kvm/x86.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

Comments

Sean Christopherson Jan. 3, 2023, 5:17 p.m. UTC | #1
On Thu, Dec 29, 2022, Michal Luczaj wrote:
> Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.

This needs a much more descriptive changelog, and an update to
Documentation/virt/kvm/locking.rst to define the ordering requirements between
kvm->scru and kvm->lock.  And IIUC, there is no deadlock in the current code
base, so this really should be a prep patch that's sent along with the Xen series[*]
that wants to take kvm->-srcu outside of kvm->lock.

[*] https://lore.kernel.org/all/20221222203021.1944101-2-mhal@rbox.co
Sean Christopherson Jan. 3, 2023, 5:28 p.m. UTC | #2
On Tue, Jan 03, 2023, Sean Christopherson wrote:
> On Thu, Dec 29, 2022, Michal Luczaj wrote:
> > Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.
> 
> This needs a much more descriptive changelog, and an update to
> Documentation/virt/kvm/locking.rst to define the ordering requirements between
> kvm->scru and kvm->lock.

Ah, Paolo already send a docs update.  I'll respond to that patch, I don't think
the assertion that "kvm->lock is taken inside kvm->srcu" is true prior to the Xen
change.

https://lore.kernel.org/all/20221228110410.1682852-2-pbonzini@redhat.com
Michal Luczaj Jan. 5, 2023, 7:32 p.m. UTC | #3
On 1/3/23 18:17, Sean Christopherson wrote:
> On Thu, Dec 29, 2022, Michal Luczaj wrote:
>> Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.
> 
> This needs a much more descriptive changelog, and an update to
> Documentation/virt/kvm/locking.rst to define the ordering requirements between
> kvm->scru and kvm->lock.  And IIUC, there is no deadlock in the current code
> base, so this really should be a prep patch that's sent along with the Xen series[*]
> that wants to take kvm->-srcu outside of kvm->lock.
> 
> [*] https://lore.kernel.org/all/20221222203021.1944101-2-mhal@rbox.co

I'd be happy to provide a more descriptive changelog, but right now I'm a
bit confused. I'd be really grateful for some clarifications:

I'm not sure how to understand "no deadlock in the current code base". I've
ran selftests[1] under the up-to-date mainline/master and I do see the
deadlocks. Is there a branch where kvm_xen_set_evtchn() is not taking
kvm->lock while inside kvm->srcu?

Also, is there a consensus as for the lock ordering? IOW, is the state of
virt/kvm/locking.rst up to date, regardless of the discussion going on[2]?

[1] https://lore.kernel.org/kvm/15599980-bd2e-b6c2-1479-e1eef02da0b5@rbox.co/
[2] https://lore.kernel.org/kvm/Y7RpB+trpnhVRhQW@google.com/

thanks,
Michal
Sean Christopherson Jan. 5, 2023, 10:23 p.m. UTC | #4
On Thu, Jan 05, 2023, Michal Luczaj wrote:
> On 1/3/23 18:17, Sean Christopherson wrote:
> > On Thu, Dec 29, 2022, Michal Luczaj wrote:
> >> Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.
> > 
> > This needs a much more descriptive changelog, and an update to
> > Documentation/virt/kvm/locking.rst to define the ordering requirements between
> > kvm->scru and kvm->lock.  And IIUC, there is no deadlock in the current code
> > base, so this really should be a prep patch that's sent along with the Xen series[*]
> > that wants to take kvm->-srcu outside of kvm->lock.
> > 
> > [*] https://lore.kernel.org/all/20221222203021.1944101-2-mhal@rbox.co
> 
> I'd be happy to provide a more descriptive changelog, but right now I'm a
> bit confused. I'd be really grateful for some clarifications:
> 
> I'm not sure how to understand "no deadlock in the current code base". I've
> ran selftests[1] under the up-to-date mainline/master and I do see the
> deadlocks. Is there a branch where kvm_xen_set_evtchn() is not taking
> kvm->lock while inside kvm->srcu?

Ah, no, I'm the one that's confused, I saw an earlier patch touch SRCU stuff and
assumed it introduced the deadlock.  Actually, it's the KVM Xen code that's confused.

This comment in kvm_xen_set_evtchn() is a tragicomedy.  It explicitly calls out
the exact case that would be problematic (Xen hypercall), but commit 2fd6df2f2b47
("KVM: x86/xen: intercept EVTCHNOP_send from guests") ran right past that.

	/*
	 * For the irqfd workqueue, using the main kvm->lock mutex is
	 * fine since this function is invoked from kvm_set_irq() with
	 * no other lock held, no srcu. In future if it will be called
	 * directly from a vCPU thread (e.g. on hypercall for an IPI)
	 * then it may need to switch to using a leaf-node mutex for
	 * serializing the shared_info mapping.
	 */
	mutex_lock(&kvm->lock);

> Also, is there a consensus as for the lock ordering?  IOW, is the state of
> virt/kvm/locking.rst up to date, regardless of the discussion going on[2]?

I'm not convinced that allowing kvm->lock to be taken while holding kvm->srcu is
a good idea.  Requiring kvm->lock to be dropped before doing synchronize_srcu()
isn't problematic, and arguably it's a good rule since holding kvm->lock for
longer than necessary is undesirable.  What I don't like is taking kvm->lock inside
kvm->srcu.  It's not documented, but in pretty much every other case except Xen,
sleepable locks are taken outside of kvm->srcu, e.g. vcpu->mutex, slots_lock, and
quite often kvm->lock itself.

Ha!  Case in point.  The aforementioned Xen code blatantly violates KVM's locking
rules:

  - kvm->lock is taken outside vcpu->mutex

In the kvm_xen_hypercal() case, vcpu->mutex is held (KVM_RUN) when kvm_xen_set_evtchn()
is called, i.e. takes kvm->lock inside vcpu->mutex.  It doesn't cause explosions
because KVM x86 only takes vcpu->mutex inside kvm->lock for SEV, and no one runs
Xen+SEV guests, but the Xen code is still a trainwreck waiting to happen.

In other words, I'm find with this patch for optimization purposes, but I don't
think we should call it a bug fix.  commit 2fd6df2f2b47 ("KVM: x86/xen: intercept
EVTCHNOP_send from guests") is the one who is wrong and needs fixing.
Sean Christopherson Jan. 5, 2023, 10:46 p.m. UTC | #5
On Thu, Dec 29, 2022, Michal Luczaj wrote:
> Move synchronize_srcu(&kvm->srcu) out of kvm->lock critical section.
> 
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  arch/x86/kvm/x86.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index da4bbd043a7b..16c89f7e98c3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6460,7 +6460,7 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm,
>  	struct kvm_x86_msr_filter *new_filter, *old_filter;
>  	bool default_allow;
>  	bool empty = true;
> -	int r = 0;
> +	int r;

Separate change that should be its own patch (even though it's trivial).
>  	u32 i;
>  
>  	if (filter->flags & ~KVM_MSR_FILTER_VALID_MASK)
> @@ -6488,16 +6488,14 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm,
>  	mutex_lock(&kvm->lock);
>  
>  	/* The per-VM filter is protected by kvm->lock... */
> -	old_filter = srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1);
> +	old_filter = rcu_replace_pointer(kvm->arch.msr_filter, new_filter, 1);

Same here.  Can you also add a patch to replace the '1' with "mutex_is_locked(&kvm->lock)"?

> +	kvm_make_all_cpus_request(kvm, KVM_REQ_MSR_FILTER_CHANGED);

I think the request can actually be moved outside of kvm->lock too (in yet another
patch).  Iterating over vCPUs without kvm->lock is safe; kvm_make_all_cpus_request()
might race with adding a new vCPU, i.e. send an unnecessary request, but
kvm->online_vcpus is never decremented.

> -	rcu_assign_pointer(kvm->arch.msr_filter, new_filter);
> -	synchronize_srcu(&kvm->srcu);
> +	mutex_unlock(&kvm->lock);
>  
> +	synchronize_srcu(&kvm->srcu);
>  	kvm_free_msr_filter(old_filter);
>  
> -	kvm_make_all_cpus_request(kvm, KVM_REQ_MSR_FILTER_CHANGED);
> -	mutex_unlock(&kvm->lock);
> -
>  	return 0;
>  }
>  
> -- 
> 2.39.0
>
Paolo Bonzini Jan. 5, 2023, 11:02 p.m. UTC | #6
On 1/5/23 23:23, Sean Christopherson wrote:
> Ha!  Case in point.  The aforementioned Xen code blatantly violates KVM's locking
> rules:
> 
>    - kvm->lock is taken outside vcpu->mutex

Ouch yeah, that's not salvageable.  Anything that takes kvm->lock inside 
kvm->srcu transitively has to be taking kvm->lock inside vcpu->mutex as 
well.

In abstract I don't think that "vcpu->mutex inside kvm->lock" would be a 
particularly problematic rule; kvm->lock critical sections are much 
shorter than vcpu->mutex which covers all of KVM_RUN for example, and 
that hints at making vcpu->mutex the *outer* mutex.  However, I 
completely forgot the sev_lock_vcpus_for_migration case, which is the 
exception that... well, disproves the rule.

Fortunately, it's pretty easy to introduce a new lock just for xen.c and 
revert the docs patch.

Paolo
Sean Christopherson Jan. 5, 2023, 11:07 p.m. UTC | #7
On Fri, Jan 06, 2023, Paolo Bonzini wrote:
> On 1/5/23 23:23, Sean Christopherson wrote:
> > Ha!  Case in point.  The aforementioned Xen code blatantly violates KVM's locking
> > rules:
> > 
> >    - kvm->lock is taken outside vcpu->mutex
> 
> Ouch yeah, that's not salvageable.  Anything that takes kvm->lock inside
> kvm->srcu transitively has to be taking kvm->lock inside vcpu->mutex as
> well.
> 
> In abstract I don't think that "vcpu->mutex inside kvm->lock" would be a
> particularly problematic rule; kvm->lock critical sections are much shorter
> than vcpu->mutex which covers all of KVM_RUN for example, and that hints at
> making vcpu->mutex the *outer* mutex.  However, I completely forgot the
> sev_lock_vcpus_for_migration case, which is the exception that... well,
> disproves the rule.

Ya, and there are plenty more instances outside of x86.

ARM's vGIC stuff also does similar things, see lock_all_vcpus().

PPC's kvmppc_xive_release() and kvmppc_xics_release().

s390's kvm_s390_cpus_from_pv().
David Woodhouse Jan. 6, 2023, 10:06 a.m. UTC | #8
On Thu, 2023-01-05 at 22:23 +0000, Sean Christopherson wrote:
> This comment in kvm_xen_set_evtchn() is a tragicomedy.  It explicitly calls out
> the exact case that would be problematic (Xen hypercall), but commit 2fd6df2f2b47
> ("KVM: x86/xen: intercept EVTCHNOP_send from guests") ran right past that.
> 
>         /*
>          * For the irqfd workqueue, using the main kvm->lock mutex is
>          * fine since this function is invoked from kvm_set_irq() with
>          * no other lock held, no srcu. In future if it will be called
>          * directly from a vCPU thread (e.g. on hypercall for an IPI)
>          * then it may need to switch to using a leaf-node mutex for
>          * serializing the shared_info mapping.
>          */
>         mutex_lock(&kvm->lock);
> 

Yeah, turns out I knew that once. Sometimes I'm cleverer than other
times. I try to leave helpful notes for the other one, but he doesn't
always read them...
Michal Luczaj Jan. 7, 2023, 12:06 a.m. UTC | #9
On 1/5/23 23:23, Sean Christopherson wrote:
> Ha!  Case in point.  The aforementioned Xen code blatantly violates KVM's locking
> rules:
> 
>   - kvm->lock is taken outside vcpu->mutex
> 

FWIW, I guess this looks like a violation of the same sort:

kvm_vcpu_ioctl()
  mutex_lock_killable(&vcpu->mutex)
  kvm_arch_vcpu_ioctl()
    kvm_xen_vcpu_get_attr() / kvm_xen_vcpu_set_attr()
      mutex_lock(&vcpu->kvm->lock)

> In other words, I'm find with this patch for optimization purposes, but I don't
> think we should call it a bug fix. (...)

Sure, resending as such, along with minor fixes.

thanks,
Michal
David Woodhouse Jan. 10, 2023, 12:55 p.m. UTC | #10
On Fri, 2023-01-06 at 00:02 +0100, Paolo Bonzini wrote:
> On 1/5/23 23:23, Sean Christopherson wrote:
> > Ha!  Case in point.  The aforementioned Xen code blatantly violates KVM's locking
> > rules:
> > 
> >    - kvm->lock is taken outside vcpu->mutex
> 
> Ouch yeah, that's not salvageable.  Anything that takes kvm->lock inside 
> kvm->srcu transitively has to be taking kvm->lock inside vcpu->mutex as 
> well.
> 
> In abstract I don't think that "vcpu->mutex inside kvm->lock" would be a 
> particularly problematic rule; kvm->lock critical sections are much 
> shorter than vcpu->mutex which covers all of KVM_RUN for example, and
> that hints at making vcpu->mutex the *outer* mutex.  However, I 
> completely forgot the sev_lock_vcpus_for_migration case, which is the
> exception that... well, disproves the rule.
> 

But because it's an exception and rarely happens in practice, lockdep
didn't notice and keep me honest sooner? Can we take them in that order
just for fun at startup, to make sure lockdep knows?

> Fortunately, it's pretty easy to introduce a new lock just for xen.c and 
> revert the docs patch.

The wording of that made me hold off, on the expectation that if I did
it myself, you'd probably beat me to it with a patch. But I don't see
one yet. Shall I?
Paolo Bonzini Jan. 10, 2023, 2:10 p.m. UTC | #11
On 1/10/23 13:55, David Woodhouse wrote:
>> However, I
>> completely forgot the sev_lock_vcpus_for_migration case, which is the
>> exception that... well, disproves the rule.
>>
> But because it's an exception and rarely happens in practice, lockdep
> didn't notice and keep me honest sooner? Can we take them in that order
> just for fun at startup, to make sure lockdep knows?

Sure, why not.  Out of curiosity, is this kind of "priming" a thing 
elsewhere in the kernel?

>> Fortunately, it's pretty easy to introduce a new lock just for xen.c and
>> revert the docs patch.
> The wording of that made me hold off, on the expectation that if I did
> it myself, you'd probably beat me to it with a patch. But I don't see
> one yet. Shall I?

No, I have already written it but didn't send it because I wanted to 
test it on the real thing using your QEMU patches. :)  But that was a 
rabbit hole of its own, my Xen knowledge is somewhat outdated.

Paolo
David Woodhouse Jan. 10, 2023, 3:27 p.m. UTC | #12
On Tue, 2023-01-10 at 15:10 +0100, Paolo Bonzini wrote:
> On 1/10/23 13:55, David Woodhouse wrote:
> > > However, I
> > > completely forgot the sev_lock_vcpus_for_migration case, which is the
> > > exception that... well, disproves the rule.
> > > 
> > But because it's an exception and rarely happens in practice, lockdep
> > didn't notice and keep me honest sooner? Can we take them in that order
> > just for fun at startup, to make sure lockdep knows?
> 
> Sure, why not.  Out of curiosity, is this kind of "priming" a thing 
> elsewhere in the kernel?

Dunno. Other people might know how to tell lockdep about it explicitly
instead of purely empirically :)

> > > Fortunately, it's pretty easy to introduce a new lock just for xen.c and
> > > revert the docs patch.
> >
> > The wording of that made me hold off, on the expectation that if I did
> > it myself, you'd probably beat me to it with a patch. But I don't see
> > one yet. Shall I?
> 
> No, I have already written it but didn't send it because I wanted to 
> test it on the real thing using your QEMU patches. :)  But that was a
> rabbit hole of its own, my Xen knowledge is somewhat outdated.

The self-tests are much more likely to show it up than real usage,
where all this stuff is fairly pedestrian. But that qemu branch should
run fairly easily with any standard distro kernel and the command line
I gave in e.g.
https://lore.kernel.org/qemu-devel/20230110122042.1562155-1-dwmw2@infradead.org/

The trick is to pass it a SATA disk, because there's no "unplug" for
those. And until we get the Xen PV disk back end working, you don't
*want* to unplug the emulated one :)
David Woodhouse Jan. 10, 2023, 7:17 p.m. UTC | #13
On Tue, 2023-01-10 at 15:10 +0100, Paolo Bonzini wrote:
> On 1/10/23 13:55, David Woodhouse wrote:
> > > However, I
> > > completely forgot the sev_lock_vcpus_for_migration case, which is the
> > > exception that... well, disproves the rule.
> > > 
> > But because it's an exception and rarely happens in practice, lockdep
> > didn't notice and keep me honest sooner? Can we take them in that order
> > just for fun at startup, to make sure lockdep knows?
> 
> Sure, why not.  Out of curiosity, is this kind of "priming" a thing 
> elsewhere in the kernel

I did this:

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -461,6 +461,11 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
 static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
        mutex_init(&vcpu->mutex);
+
+       /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
+       mutex_lock(&vcpu->mutex);
+       mutex_unlock(&vcpu->mutex);
+
        vcpu->cpu = -1;
        vcpu->kvm = kvm;
        vcpu->vcpu_id = id;


What I got when I ran xen_shinfo_test was... not what I expected:


[13890.148203] ======================================================
[13890.148205] WARNING: possible circular locking dependency detected
[13890.148207] 6.1.0-rc4+ #1024 Tainted: G          I E     
[13890.148209] ------------------------------------------------------
[13890.148210] xen_shinfo_test/13326 is trying to acquire lock:
[13890.148212] ffff888107d493b0 (&gpc->lock){....}-{2:2}, at: kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
[13890.148285] 
               but task is already holding lock:
[13890.148287] ffff88887f671718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x84/0x7c0
[13890.148295] 
               which lock already depends on the new lock.

[13890.148296] 
               the existing dependency chain (in reverse order) is:
[13890.148298] 
               -> #4 (&rq->__lock){-.-.}-{2:2}:
[13890.148301]        __lock_acquire+0x4b4/0x940
[13890.148306]        lock_acquire.part.0+0xa8/0x210
[13890.148309]        _raw_spin_lock_nested+0x35/0x50
[13890.148313]        raw_spin_rq_lock_nested+0x23/0x30
[13890.148318]        task_fork_fair+0x45/0x170
[13890.148322]        sched_cgroup_fork+0x11a/0x160
[13890.148325]        copy_process+0x1139/0x1950
[13890.148329]        kernel_clone+0x9b/0x390
[13890.148332]        user_mode_thread+0x5b/0x80
[13890.148335]        rest_init+0x1e/0x170
[13890.148338]        arch_call_rest_init+0xa/0x14
[13890.148342]        start_kernel+0x647/0x670
[13890.148345]        secondary_startup_64_no_verify+0xd3/0xdb
[13890.148349] 
               -> #3 (&p->pi_lock){-.-.}-{2:2}:
[13890.148352]        __lock_acquire+0x4b4/0x940
[13890.148355]        lock_acquire.part.0+0xa8/0x210
[13890.148357]        __raw_spin_lock_irqsave+0x44/0x60
[13890.148360]        try_to_wake_up+0x69/0x360
[13890.148362]        create_worker+0x129/0x1a0
[13890.148366]        workqueue_init+0x14b/0x1b0
[13890.148371]        kernel_init_freeable+0x95/0x122
[13890.148373]        kernel_init+0x16/0x130
[13890.148375]        ret_from_fork+0x22/0x30
[13890.148378] 
               -> #2 (&pool->lock){-.-.}-{2:2}:
[13890.148381]        __lock_acquire+0x4b4/0x940
[13890.148384]        lock_acquire.part.0+0xa8/0x210
[13890.148386]        _raw_spin_lock+0x2f/0x40
[13890.148389]        __queue_work+0x1a1/0x490
[13890.148391]        queue_work_on+0x75/0x80
[13890.148394]        percpu_ref_put_many.constprop.0+0xea/0xf0
[13890.148398]        __mem_cgroup_uncharge_list+0x7d/0xa0
[13890.148401]        release_pages+0x15b/0x590
[13890.148404]        folio_batch_move_lru+0xd3/0x150
[13890.148407]        lru_add_drain_cpu+0x1ce/0x270
[13890.148410]        lru_add_drain+0x77/0x140
[13890.148413]        do_wp_page+0x342/0x3a0
[13890.148417]        __handle_mm_fault+0x3a1/0x690
[13890.148421]        handle_mm_fault+0x113/0x3b0
[13890.148424]        do_user_addr_fault+0x1d8/0x6b0
[13890.148427]        exc_page_fault+0x6a/0xe0
[13890.148429]        asm_exc_page_fault+0x22/0x30
[13890.148432] 
               -> #1 (lock#4){+.+.}-{2:2}:
[13890.148436]        __lock_acquire+0x4b4/0x940
[13890.148439]        lock_acquire.part.0+0xa8/0x210
[13890.148441]        folio_mark_accessed+0x8d/0x1a0
[13890.148444]        kvm_release_page_clean+0x89/0xb0 [kvm]
[13890.148485]        hva_to_pfn_retry+0x296/0x2d0 [kvm]
[13890.148524]        __kvm_gpc_refresh+0x18e/0x310 [kvm]
[13890.148562]        kvm_xen_hvm_set_attr+0x1f5/0x2f0 [kvm]
[13890.148613]        kvm_arch_vm_ioctl+0x9bf/0xd50 [kvm]
[13890.148656]        kvm_vm_ioctl+0x5c1/0x7f0 [kvm]
[13890.148693]        __x64_sys_ioctl+0x8a/0xc0
[13890.148696]        do_syscall_64+0x3b/0x90
[13890.148701]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
[13890.148704] 
               -> #0 (&gpc->lock){....}-{2:2}:
[13890.148708]        check_prev_add+0x8f/0xc20
[13890.148710]        validate_chain+0x3ba/0x450
[13890.148713]        __lock_acquire+0x4b4/0x940
[13890.148715]        lock_acquire.part.0+0xa8/0x210
[13890.148717]        __raw_read_lock_irqsave+0x7f/0xa0
[13890.148720]        kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
[13890.148771]        kvm_arch_vcpu_put+0x1d4/0x250 [kvm]
[13890.148814]        kvm_sched_out+0x2f/0x50 [kvm]
[13890.148849]        prepare_task_switch+0xe7/0x3b0
[13890.148853]        __schedule+0x1c9/0x7c0
[13890.148857]        schedule+0x5d/0xd0
[13890.148860]        xfer_to_guest_mode_handle_work+0x59/0xd0
[13890.148865]        vcpu_run+0x328/0x410 [kvm]
[13890.148908]        kvm_arch_vcpu_ioctl_run+0x1cd/0x640 [kvm]
[13890.148950]        kvm_vcpu_ioctl+0x279/0x700 [kvm]
[13890.148986]        __x64_sys_ioctl+0x8a/0xc0
[13890.148989]        do_syscall_64+0x3b/0x90
[13890.148993]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
[13890.148996] 
               other info that might help us debug this:

[13890.148997] Chain exists of:
                 &gpc->lock --> &p->pi_lock --> &rq->__lock

[13890.149002]  Possible unsafe locking scenario:

[13890.149003]        CPU0                    CPU1
[13890.149004]        ----                    ----
[13890.149005]   lock(&rq->__lock);
[13890.149007]                                lock(&p->pi_lock);
[13890.149009]                                lock(&rq->__lock);
[13890.149011]   lock(&gpc->lock);
[13890.149013] 
                *** DEADLOCK ***

[13890.149014] 3 locks held by xen_shinfo_test/13326:
[13890.149016]  #0: ffff888107d480b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x700 [kvm]
[13890.149057]  #1: ffff88887f671718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x84/0x7c0
[13890.149064]  #2: ffffc900017c5860 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x2a/0x250 [kvm]
[13890.149109] 
               stack backtrace:
[13890.149111] CPU: 1 PID: 13326 Comm: xen_shinfo_test Tainted: G          I E      6.1.0-rc4+ #1024
[13890.149115] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[13890.149116] Call Trace:
[13890.149118]  <TASK>
[13890.149121]  dump_stack_lvl+0x56/0x73
[13890.149126]  check_noncircular+0x102/0x120
[13890.149131]  check_prev_add+0x8f/0xc20
[13890.149134]  ? validate_chain+0x22a/0x450
[13890.149136]  ? add_chain_cache+0x10b/0x2d0
[13890.149140]  validate_chain+0x3ba/0x450
[13890.149144]  __lock_acquire+0x4b4/0x940
[13890.149148]  lock_acquire.part.0+0xa8/0x210
[13890.149151]  ? kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
[13890.149204]  ? rcu_read_lock_sched_held+0x43/0x70
[13890.149208]  ? lock_acquire+0x102/0x140
[13890.149211]  __raw_read_lock_irqsave+0x7f/0xa0
[13890.149215]  ? kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
[13890.149266]  kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
[13890.149316]  ? get_kvmclock_ns+0x52/0x90 [kvm]
[13890.149359]  ? lock_acquire+0x102/0x140
[13890.149363]  kvm_arch_vcpu_put+0x1d4/0x250 [kvm]
[13890.149407]  kvm_sched_out+0x2f/0x50 [kvm]
[13890.149444]  prepare_task_switch+0xe7/0x3b0
[13890.149449]  __schedule+0x1c9/0x7c0
[13890.149454]  schedule+0x5d/0xd0
[13890.149458]  xfer_to_guest_mode_handle_work+0x59/0xd0
[13890.149463]  vcpu_run+0x328/0x410 [kvm]
[13890.149507]  kvm_arch_vcpu_ioctl_run+0x1cd/0x640 [kvm]
[13890.149551]  kvm_vcpu_ioctl+0x279/0x700 [kvm]
[13890.149588]  ? exc_page_fault+0xdb/0xe0
[13890.149591]  ? _raw_spin_unlock_irq+0x34/0x50
[13890.149595]  ? do_setitimer+0x190/0x1e0
[13890.149600]  __x64_sys_ioctl+0x8a/0xc0
[13890.149604]  do_syscall_64+0x3b/0x90
[13890.149607]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[13890.149611] RIP: 0033:0x7fa394a3fd1b
[13890.149614] Code: 73 01 c3 48 8b 0d 05 a1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 a0 1b 00 f7 d8 64 89 01 48
[13890.149617] RSP: 002b:00007ffe7f86c0a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13890.149620] RAX: ffffffffffffffda RBX: 00007fa394e01000 RCX: 00007fa394a3fd1b
[13890.149622] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000007
[13890.149624] RBP: 00007fa394dc96c0 R08: 000000000041827e R09: 0000000000418234
[13890.149626] R10: 00007fa394bb936b R11: 0000000000000246 R12: 00000000018f9800
[13890.149628] R13: 000000000000000a R14: 00007fa394dffff1 R15: 00000000018f72a0
[13890.149632]  </TASK>
Sean Christopherson Jan. 10, 2023, 7:37 p.m. UTC | #14
On Tue, Jan 10, 2023, David Woodhouse wrote:
> On Tue, 2023-01-10 at 15:10 +0100, Paolo Bonzini wrote:
> > On 1/10/23 13:55, David Woodhouse wrote:
> > > > However, I
> > > > completely forgot the sev_lock_vcpus_for_migration case, which is the
> > > > exception that... well, disproves the rule.
> > > > 
> > > But because it's an exception and rarely happens in practice, lockdep
> > > didn't notice and keep me honest sooner? Can we take them in that order
> > > just for fun at startup, to make sure lockdep knows?
> > 
> > Sure, why not.  Out of curiosity, is this kind of "priming" a thing 
> > elsewhere in the kernel
> 
> I did this:
> 
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -461,6 +461,11 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>  static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  {
>         mutex_init(&vcpu->mutex);
> +
> +       /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
> +       mutex_lock(&vcpu->mutex);
> +       mutex_unlock(&vcpu->mutex);

No idea about the splat below, but kvm_vcpu_init() doesn't run under kvm->lock,
so I wouldn't expect this to do anything.
David Woodhouse Jan. 10, 2023, 7:46 p.m. UTC | #15
On Tue, 2023-01-10 at 19:37 +0000, Sean Christopherson wrote:
> No idea about the splat below, but kvm_vcpu_init() doesn't run under kvm->lock,
> so I wouldn't expect this to do anything.

Ah, good point. I think I misread kvm_vm_ioctl_create_vcpu() *dropping*
kvm->lock a few lines above calling kvm_vcpu_init().

This one gives me the splat I was *expecting*. But Paolo said he had a
patch for that and now the other one is *much* more interesting...


--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3924,6 +3924,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
        }
 
        mutex_lock(&kvm->lock);
+
+       /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
+       mutex_lock(&vcpu->mutex);
+       mutex_unlock(&vcpu->mutex);
+
        if (kvm_get_vcpu_by_id(kvm, id)) {
                r = -EEXIST;
                goto unlock_vcpu_destroy;


[  111.042398] ======================================================
[  111.042400] WARNING: possible circular locking dependency detected
[  111.042402] 6.1.0-rc4+ #1024 Tainted: G          I E     
[  111.042405] ------------------------------------------------------
[  111.042406] xen_shinfo_test/11035 is trying to acquire lock:
[  111.042409] ffffc900017803c0 (&kvm->lock){+.+.}-{3:3}, at: kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.042494] 
               but task is already holding lock:
[  111.042496] ffff88828f9600b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x700 [kvm]
[  111.042547] 
               which lock already depends on the new lock.

[  111.042548] 
               the existing dependency chain (in reverse order) is:
[  111.042550] 
               -> #1 (&vcpu->mutex){+.+.}-{3:3}:
[  111.042554]        __lock_acquire+0x4b4/0x940
[  111.042561]        lock_acquire.part.0+0xa8/0x210
[  111.042564]        __mutex_lock+0x94/0x920
[  111.042571]        kvm_vm_ioctl_create_vcpu+0x1c1/0x4b0 [kvm]
[  111.042618]        kvm_vm_ioctl+0x565/0x7f0 [kvm]
[  111.042664]        __x64_sys_ioctl+0x8a/0xc0
[  111.042669]        do_syscall_64+0x3b/0x90
[  111.042674]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  111.042679] 
               -> #0 (&kvm->lock){+.+.}-{3:3}:
[  111.042683]        check_prev_add+0x8f/0xc20
[  111.042687]        validate_chain+0x3ba/0x450
[  111.042690]        __lock_acquire+0x4b4/0x940
[  111.042693]        lock_acquire.part.0+0xa8/0x210
[  111.042696]        __mutex_lock+0x94/0x920
[  111.042700]        kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.042764]        kvm_arch_vcpu_ioctl+0x817/0x12a0 [kvm]
[  111.042817]        kvm_vcpu_ioctl+0x519/0x700 [kvm]
[  111.042862]        __x64_sys_ioctl+0x8a/0xc0
[  111.042865]        do_syscall_64+0x3b/0x90
[  111.042869]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  111.042873] 
               other info that might help us debug this:

[  111.042875]  Possible unsafe locking scenario:

[  111.042876]        CPU0                    CPU1
[  111.042878]        ----                    ----
[  111.042879]   lock(&vcpu->mutex);
[  111.042882]                                lock(&kvm->lock);
[  111.042884]                                lock(&vcpu->mutex);
[  111.042887]   lock(&kvm->lock);
[  111.042889] 
                *** DEADLOCK ***

[  111.042891] 1 lock held by xen_shinfo_test/11035:
[  111.042893]  #0: ffff88828f9600b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x700 [kvm]
[  111.042943] 
               stack backtrace:
[  111.042946] CPU: 23 PID: 11035 Comm: xen_shinfo_test Tainted: G          I E      6.1.0-rc4+ #1024
[  111.042949] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[  111.042952] Call Trace:
[  111.042954]  <TASK>
[  111.042957]  dump_stack_lvl+0x56/0x73
[  111.042963]  check_noncircular+0x102/0x120
[  111.042970]  check_prev_add+0x8f/0xc20
[  111.042973]  ? add_chain_cache+0x10b/0x2d0
[  111.042976]  ? _raw_spin_unlock_irqrestore+0x2d/0x60
[  111.042982]  validate_chain+0x3ba/0x450
[  111.042986]  __lock_acquire+0x4b4/0x940
[  111.042991]  lock_acquire.part.0+0xa8/0x210
[  111.042995]  ? kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.043061]  ? rcu_read_lock_sched_held+0x43/0x70
[  111.043067]  ? lock_acquire+0x102/0x140
[  111.043072]  __mutex_lock+0x94/0x920
[  111.043077]  ? kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.043141]  ? __lock_acquire+0x4b4/0x940
[  111.043145]  ? kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.043209]  ? kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.043271]  ? vmx_vcpu_load+0x27/0x40 [kvm_intel]
[  111.043286]  kvm_xen_vcpu_set_attr+0x2a/0x4f0 [kvm]
[  111.043349]  ? kvm_arch_vcpu_load+0x66/0x200 [kvm]
[  111.043405]  kvm_arch_vcpu_ioctl+0x817/0x12a0 [kvm]
[  111.043464]  ? trace_contention_end+0x2d/0xd0
[  111.043475]  kvm_vcpu_ioctl+0x519/0x700 [kvm]
[  111.043525]  ? do_user_addr_fault+0x1fa/0x6b0
[  111.043532]  ? do_user_addr_fault+0x1fa/0x6b0
[  111.043538]  __x64_sys_ioctl+0x8a/0xc0
[  111.043544]  do_syscall_64+0x3b/0x90
[  111.043549]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  111.043554] RIP: 0033:0x7f485cc3fd1b
[  111.043558] Code: 73 01 c3 48 8b 0d 05 a1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 a0 1b 00 f7 d8 64 89 01 48
[  111.043561] RSP: 002b:00007ffe8726c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  111.043565] RAX: ffffffffffffffda RBX: 00007f485d027000 RCX: 00007f485cc3fd1b
[  111.043568] RDX: 00007ffe8726c160 RSI: 000000004048aecb RDI: 0000000000000007
[  111.043570] RBP: 00007f485cff26c0 R08: 00000000004188f0 R09: 0000000000000000
[  111.043573] R10: 0000000000000012 R11: 0000000000000246 R12: 0000000000000010
[  111.043575] R13: 000000002645a922 R14: 63bdc02500000002 R15: 00000000021082a0
[  111.043581]  </TASK>
David Woodhouse Jan. 11, 2023, 8:49 a.m. UTC | #16
On Tue, 2023-01-10 at 19:17 +0000, David Woodhouse wrote:
> On Tue, 2023-01-10 at 15:10 +0100, Paolo Bonzini wrote:
> > On 1/10/23 13:55, David Woodhouse wrote:
> > > > However, I
> > > > completely forgot the sev_lock_vcpus_for_migration case, which is the
> > > > exception that... well, disproves the rule.
> > > > 
> > > But because it's an exception and rarely happens in practice, lockdep
> > > didn't notice and keep me honest sooner? Can we take them in that order
> > > just for fun at startup, to make sure lockdep knows?
> > 
> > Sure, why not.  Out of curiosity, is this kind of "priming" a thing
> > elsewhere in the kernel
> 
> I did this:
> 
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -461,6 +461,11 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>  static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  {
>         mutex_init(&vcpu->mutex);
> +
> +       /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
> +       mutex_lock(&vcpu->mutex);
> +       mutex_unlock(&vcpu->mutex);
> +
>         vcpu->cpu = -1;
>         vcpu->kvm = kvm;
>         vcpu->vcpu_id = id;
> 
> 
> What I got when I ran xen_shinfo_test was... not what I expected:
> 
> 
> [13890.148203] ======================================================
> [13890.148205] WARNING: possible circular locking dependency detected
> [13890.148207] 6.1.0-rc4+ #1024 Tainted: G          I E     
> [13890.148209] ------------------------------------------------------
> [13890.148210] xen_shinfo_test/13326 is trying to acquire lock:
> [13890.148212] ffff888107d493b0 (&gpc->lock){....}-{2:2}, at: kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
> [13890.148285] 
>                but task is already holding lock:
> [13890.148287] ffff88887f671718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x84/0x7c0
> [13890.148295] 
>                which lock already depends on the new lock.
> 

...

> [13890.148997] Chain exists of:
>                  &gpc->lock --> &p->pi_lock --> &rq->__lock


I believe that's because of the priority inheritance check which
happens when there's *contention* for gpc->lock.

But in the majority of cases, anything taking a write lock on that (and
thus causing contention) is going to be invalidating the cache *anyway*
and causing us to abort. Or revalidating it, I suppose. But either way
we're racing with seeing an invalid state. 

In which case we much as well just make it a trylock. Doing so *only*
for the scheduling-out atomic code path looks a bit like this... is
there a prettier way?

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 07e61cc9881e..c444948ab1ac 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -272,7 +272,12 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * Attempt to obtain the GPC lock on *both* (if there are two)
 	 * gfn_to_pfn caches that cover the region.
 	 */
-	read_lock_irqsave(&gpc1->lock, flags);
+	local_irq_save(flags);
+	if (!read_trylock(&gpc1->lock)) {
+		if (atomic)
+			return;
+		read_lock(&gpc1->lock);
+	}
 	while (!kvm_gpc_check(gpc1, user_len1)) {
 		read_unlock_irqrestore(&gpc1->lock, flags);
 
@@ -283,7 +288,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		goto retry;
 	}
 
 	if (likely(!user_len2)) {
@@ -309,7 +314,13 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		 * gpc1 lock to make lockdep shut up about it.
 		 */
 		lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
-		read_lock(&gpc2->lock);
+		if (!read_trylock(&gpc2->lock)) {
+			if (atomic) {
+				read_unlock_irqrestore(&gpc1->lock, flags);
+				return;
+			}
+			read_lock(&gpc2->lock);
+		}
 
 		if (!kvm_gpc_check(gpc2, user_len2)) {
 			read_unlock(&gpc2->lock);
Paolo Bonzini Jan. 11, 2023, 10:49 p.m. UTC | #17
On 1/11/23 09:49, David Woodhouse wrote:
>> [13890.148203] ======================================================
>> [13890.148205] WARNING: possible circular locking dependency detected
>> [13890.148207] 6.1.0-rc4+ #1024 Tainted: G          I E
>> [13890.148209] ------------------------------------------------------
>> [13890.148210] xen_shinfo_test/13326 is trying to acquire lock:
>> [13890.148212] ffff888107d493b0 (&gpc->lock){....}-{2:2}, at: kvm_xen_update_runstate_guest+0xf2/0x4e0 [kvm]
>> [13890.148285]
>>                 but task is already holding lock:
>> [13890.148287] ffff88887f671718 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x84/0x7c0
>> [13890.148295]
>>                 which lock already depends on the new lock.
>>
> 
> ...
> 
>> [13890.148997] Chain exists of:
>>                   &gpc->lock --> &p->pi_lock --> &rq->__lock
> 
> 
> I believe that's because of the priority inheritance check which
> happens when there's *contention* for gpc->lock.
> 
> But in the majority of cases, anything taking a write lock on that (and
> thus causing contention) is going to be invalidating the cache *anyway*
> and causing us to abort. Or revalidating it, I suppose. But either way
> we're racing with seeing an invalid state.
> 
> In which case we much as well just make it a trylock. Doing so *only*
> for the scheduling-out atomic code path looks a bit like this... is
> there a prettier way?

I'll add a prettier way (or at least encapsulate this one) when I send 
the patches to introduce lock+check functions for gpc.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da4bbd043a7b..16c89f7e98c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6460,7 +6460,7 @@  static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm,
 	struct kvm_x86_msr_filter *new_filter, *old_filter;
 	bool default_allow;
 	bool empty = true;
-	int r = 0;
+	int r;
 	u32 i;
 
 	if (filter->flags & ~KVM_MSR_FILTER_VALID_MASK)
@@ -6488,16 +6488,14 @@  static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 
 	/* The per-VM filter is protected by kvm->lock... */
-	old_filter = srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1);
+	old_filter = rcu_replace_pointer(kvm->arch.msr_filter, new_filter, 1);
+	kvm_make_all_cpus_request(kvm, KVM_REQ_MSR_FILTER_CHANGED);
 
-	rcu_assign_pointer(kvm->arch.msr_filter, new_filter);
-	synchronize_srcu(&kvm->srcu);
+	mutex_unlock(&kvm->lock);
 
+	synchronize_srcu(&kvm->srcu);
 	kvm_free_msr_filter(old_filter);
 
-	kvm_make_all_cpus_request(kvm, KVM_REQ_MSR_FILTER_CHANGED);
-	mutex_unlock(&kvm->lock);
-
 	return 0;
 }