Message ID | 20230518100914.2837292-1-jean-philippe@linaro.org (mailing list archive) |
---|---|
Headers | show |
Series | KVM: arm64: vgic: Locking fixes | expand |
On Thu, May 18, 2023 at 11:09:14AM +0100, Jean-Philippe Brucker wrote: > Another fun locking puzzle, between the new config_lock and srcu. > Patch 1 attempts to fix it, and the other patches fix simpler issues. > > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it > can also be triggered by running the vgic_irq kselftest on TCG QEMU. > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's > an unrelated weirdness: if I introduce a separate lockdep warning for > some made up locks, then the test passes again. So I'm sending this out > now for discussion, and will investigate that one later. > > I run the host with: > > qemu-system-aarch64 -M virt,virtualization=true,gic-version=3 -m 2G -smp 8 -cpu max -append 'console=hvc0 root=/dev/vda rw' -kernel Image -nodefaults -chardev stdio,mux=on,id=virtiocon0,signal=off -device virtio-serial-device -device virtconsole,chardev=virtiocon0 -mon chardev=virtiocon0,mode=readline -device virtio-blk-device,drive=hd0 -drive format=raw,if=none,file=virt_root.bin,id=hd0 -nographic > > Jean-Philippe Brucker (4): > KVM: arm64: vgic: Fix a circular locking issue > KVM: arm64: vgic: Wrap vgic_its_create() with config_lock > KVM: arm64: vgic: Fix locking comment > KVM: arm64: vgic: Fix a comment Damn! Thought I had this sorted. Thanks for taking a crack at this. I've looked it over a couple times, and overall looks good. For the series: Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Once I get some free cycles this afternoon/evening I'll take the patches for a spin.
On Thu, 18 May 2023 11:09:14 +0100, Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > Another fun locking puzzle, between the new config_lock and srcu. > Patch 1 attempts to fix it, and the other patches fix simpler issues. Thanks for that and for your excellent description of the problems. > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it > can also be triggered by running the vgic_irq kselftest on TCG QEMU. > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's > an unrelated weirdness: if I introduce a separate lockdep warning for > some made up locks, then the test passes again. So I'm sending this out > now for discussion, and will investigate that one later. I've taken these patches for a spin, and I cannot reproduce this hang, though I'm running on actual HW and not QEMU. It would be really annoying if lockdep actively introduced issues... :-/ Any chance you could dig into this as you have a good reproducer? I'll try to setup a TGC environment on my end as well. Thanks, M.
On Fri, May 19, 2023 at 09:46:45AM +0100, Marc Zyngier wrote: > On Thu, 18 May 2023 11:09:14 +0100, > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > > > Another fun locking puzzle, between the new config_lock and srcu. > > Patch 1 attempts to fix it, and the other patches fix simpler issues. > > Thanks for that and for your excellent description of the problems. > > > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it > > can also be triggered by running the vgic_irq kselftest on TCG QEMU. > > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's > > an unrelated weirdness: if I introduce a separate lockdep warning for > > some made up locks, then the test passes again. So I'm sending this out > > now for discussion, and will investigate that one later. > > I've taken these patches for a spin, and I cannot reproduce this hang, > though I'm running on actual HW and not QEMU. It would be really > annoying if lockdep actively introduced issues... :-/ > > Any chance you could dig into this as you have a good reproducer? I'll > try to setup a TGC environment on my end as well. Sure, I'll investigate this. All I know so far is that it is stuck somewhere in the first test_inject_preemption() Thanks, Jean
On Fri, May 19, 2023 at 09:46:45AM +0100, Marc Zyngier wrote: > On Thu, 18 May 2023 11:09:14 +0100, > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > > > Another fun locking puzzle, between the new config_lock and srcu. > > Patch 1 attempts to fix it, and the other patches fix simpler issues. > > Thanks for that and for your excellent description of the problems. > > > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it > > can also be triggered by running the vgic_irq kselftest on TCG QEMU. > > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's > > an unrelated weirdness: if I introduce a separate lockdep warning for > > some made up locks, then the test passes again. So I'm sending this out > > now for discussion, and will investigate that one later. > > I've taken these patches for a spin, and I cannot reproduce this hang, > though I'm running on actual HW and not QEMU. It would be really > annoying if lockdep actively introduced issues... :-/ > > Any chance you could dig into this as you have a good reproducer? I'll > try to setup a TGC environment on my end as well. I'm not sure this is a fixable problem, didn't find anything obviously wrong. It's just that when running with lockdep and KASAN (forgot about that one, sorry) on an emulated platform, some tests can't make progress. In this config I can boot a guest without problem, but this particular test gets stuck when trying to inject 32 interrupts. What I see is: 1. KVM prepares to enter the guest, spends a lot of time dealing with the vGIC with interrupts disabled. 2. Just before entering the guest it sees ISR_EL1 set (the timer interrupt is pending), so cancels the return to guest, reprograms the timer and goto 1. For this particular test, trying to inject several interrupts simultaneously, I think kvm_vgic_flush_hwstate() needs to take and release lots of locks which takes forever with lockdep. I measure about 4ms inside that function, which corresponds to the timer period at HZ_250. Previously, I guess getting a lockdep warning would disable lockdep and allow the test to make progress. So maybe the vGIC could still be optimized, or maybe this isn't worth fixing, we could just say that this setup is too slow to reliably run KVM and leave it at that. Thanks, Jean
On Thu, 18 May 2023 11:09:14 +0100, Jean-Philippe Brucker wrote: > Another fun locking puzzle, between the new config_lock and srcu. > Patch 1 attempts to fix it, and the other patches fix simpler issues. > > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it > can also be triggered by running the vgic_irq kselftest on TCG QEMU. > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's > an unrelated weirdness: if I introduce a separate lockdep warning for > some made up locks, then the test passes again. So I'm sending this out > now for discussion, and will investigate that one later. > > [...] Applied to fixes, thanks! [1/4] KVM: arm64: vgic: Fix a circular locking issue commit: 59112e9c390be595224e427827475a6cd3726021 [2/4] KVM: arm64: vgic: Wrap vgic_its_create() with config_lock commit: 9cf2f840c439b6b23bd99f584f2917ca425ae406 [3/4] KVM: arm64: vgic: Fix locking comment commit: c38b8400aef99d63be2b1ff131bb993465dcafe1 [4/4] KVM: arm64: vgic: Fix a comment commit: 62548732260976ca88fcb17ef98ab661e7ce7504 Cheers, M.