mbox series

[0/4] KVM: arm64: vgic: Locking fixes

Message ID 20230518100914.2837292-1-jean-philippe@linaro.org (mailing list archive)
Headers show
Series KVM: arm64: vgic: Locking fixes | expand

Message

Jean-Philippe Brucker May 18, 2023, 10:09 a.m. UTC
Another fun locking puzzle, between the new config_lock and srcu.
Patch 1 attempts to fix it, and the other patches fix simpler issues.

I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
can also be triggered by running the vgic_irq kselftest on TCG QEMU.
Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
an unrelated weirdness: if I introduce a separate lockdep warning for
some made up locks, then the test passes again. So I'm sending this out
now for discussion, and will investigate that one later.

I run the host with:

qemu-system-aarch64 -M virt,virtualization=true,gic-version=3 -m 2G -smp 8 -cpu max -append 'console=hvc0 root=/dev/vda rw' -kernel Image -nodefaults -chardev stdio,mux=on,id=virtiocon0,signal=off -device virtio-serial-device -device virtconsole,chardev=virtiocon0 -mon chardev=virtiocon0,mode=readline -device virtio-blk-device,drive=hd0 -drive format=raw,if=none,file=virt_root.bin,id=hd0 -nographic

Jean-Philippe Brucker (4):
  KVM: arm64: vgic: Fix a circular locking issue
  KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
  KVM: arm64: vgic: Fix locking comment
  KVM: arm64: vgic: Fix a comment

 arch/arm64/kvm/vgic/vgic-init.c       | 27 +++++++++++++++++------
 arch/arm64/kvm/vgic/vgic-its.c        | 14 ++++++++----
 arch/arm64/kvm/vgic/vgic-kvm-device.c | 10 +++++++--
 arch/arm64/kvm/vgic/vgic-mmio-v3.c    | 31 ++++++++++++++++++---------
 arch/arm64/kvm/vgic/vgic-mmio.c       |  8 ++-----
 arch/arm64/kvm/vgic/vgic-v2.c         |  6 ------
 arch/arm64/kvm/vgic/vgic-v3.c         |  7 ------
 arch/arm64/kvm/vgic/vgic-v4.c         |  3 ++-
 8 files changed, 64 insertions(+), 42 deletions(-)

Comments

Oliver Upton May 18, 2023, 6:23 p.m. UTC | #1
On Thu, May 18, 2023 at 11:09:14AM +0100, Jean-Philippe Brucker wrote:
> Another fun locking puzzle, between the new config_lock and srcu.
> Patch 1 attempts to fix it, and the other patches fix simpler issues.
> 
> I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
> can also be triggered by running the vgic_irq kselftest on TCG QEMU.
> Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
> an unrelated weirdness: if I introduce a separate lockdep warning for
> some made up locks, then the test passes again. So I'm sending this out
> now for discussion, and will investigate that one later.
> 
> I run the host with:
> 
> qemu-system-aarch64 -M virt,virtualization=true,gic-version=3 -m 2G -smp 8 -cpu max -append 'console=hvc0 root=/dev/vda rw' -kernel Image -nodefaults -chardev stdio,mux=on,id=virtiocon0,signal=off -device virtio-serial-device -device virtconsole,chardev=virtiocon0 -mon chardev=virtiocon0,mode=readline -device virtio-blk-device,drive=hd0 -drive format=raw,if=none,file=virt_root.bin,id=hd0 -nographic
> 
> Jean-Philippe Brucker (4):
>   KVM: arm64: vgic: Fix a circular locking issue
>   KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
>   KVM: arm64: vgic: Fix locking comment
>   KVM: arm64: vgic: Fix a comment

Damn! Thought I had this sorted.

Thanks for taking a crack at this. I've looked it over a couple times,
and overall looks good.

For the series:

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>

Once I get some free cycles this afternoon/evening I'll take the patches
for a spin.
Marc Zyngier May 19, 2023, 8:46 a.m. UTC | #2
On Thu, 18 May 2023 11:09:14 +0100,
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> Another fun locking puzzle, between the new config_lock and srcu.
> Patch 1 attempts to fix it, and the other patches fix simpler issues.

Thanks for that and for your excellent description of the problems.

> I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
> can also be triggered by running the vgic_irq kselftest on TCG QEMU.
> Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
> an unrelated weirdness: if I introduce a separate lockdep warning for
> some made up locks, then the test passes again. So I'm sending this out
> now for discussion, and will investigate that one later.

I've taken these patches for a spin, and I cannot reproduce this hang,
though I'm running on actual HW and not QEMU. It would be really
annoying if lockdep actively introduced issues... :-/

Any chance you could dig into this as you have a good reproducer? I'll
try to setup a TGC environment on my end as well.

Thanks,

	M.
Jean-Philippe Brucker May 19, 2023, 1:22 p.m. UTC | #3
On Fri, May 19, 2023 at 09:46:45AM +0100, Marc Zyngier wrote:
> On Thu, 18 May 2023 11:09:14 +0100,
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> > 
> > Another fun locking puzzle, between the new config_lock and srcu.
> > Patch 1 attempts to fix it, and the other patches fix simpler issues.
> 
> Thanks for that and for your excellent description of the problems.
> 
> > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
> > can also be triggered by running the vgic_irq kselftest on TCG QEMU.
> > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
> > an unrelated weirdness: if I introduce a separate lockdep warning for
> > some made up locks, then the test passes again. So I'm sending this out
> > now for discussion, and will investigate that one later.
> 
> I've taken these patches for a spin, and I cannot reproduce this hang,
> though I'm running on actual HW and not QEMU. It would be really
> annoying if lockdep actively introduced issues... :-/
> 
> Any chance you could dig into this as you have a good reproducer? I'll
> try to setup a TGC environment on my end as well.

Sure, I'll investigate this. All I know so far is that it is stuck
somewhere in the first test_inject_preemption()

Thanks,
Jean
Jean-Philippe Brucker May 24, 2023, 12:40 p.m. UTC | #4
On Fri, May 19, 2023 at 09:46:45AM +0100, Marc Zyngier wrote:
> On Thu, 18 May 2023 11:09:14 +0100,
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> > 
> > Another fun locking puzzle, between the new config_lock and srcu.
> > Patch 1 attempts to fix it, and the other patches fix simpler issues.
> 
> Thanks for that and for your excellent description of the problems.
> 
> > I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
> > can also be triggered by running the vgic_irq kselftest on TCG QEMU.
> > Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
> > an unrelated weirdness: if I introduce a separate lockdep warning for
> > some made up locks, then the test passes again. So I'm sending this out
> > now for discussion, and will investigate that one later.
> 
> I've taken these patches for a spin, and I cannot reproduce this hang,
> though I'm running on actual HW and not QEMU. It would be really
> annoying if lockdep actively introduced issues... :-/
> 
> Any chance you could dig into this as you have a good reproducer? I'll
> try to setup a TGC environment on my end as well.

I'm not sure this is a fixable problem, didn't find anything obviously
wrong. It's just that when running with lockdep and KASAN (forgot about
that one, sorry) on an emulated platform, some tests can't make progress.
In this config I can boot a guest without problem, but this particular
test gets stuck when trying to inject 32 interrupts. What I see is:

1. KVM prepares to enter the guest, spends a lot of time dealing with the
   vGIC with interrupts disabled.
2. Just before entering the guest it sees ISR_EL1 set (the timer interrupt
   is pending), so cancels the return to guest, reprograms the timer and
   goto 1.

For this particular test, trying to inject several interrupts
simultaneously, I think kvm_vgic_flush_hwstate() needs to take and release
lots of locks which takes forever with lockdep. I measure about 4ms inside
that function, which corresponds to the timer period at HZ_250.

Previously, I guess getting a lockdep warning would disable lockdep and
allow the test to make progress. 

So maybe the vGIC could still be optimized, or maybe this isn't worth
fixing, we could just say that this setup is too slow to reliably run KVM
and leave it at that.

Thanks,
Jean
Marc Zyngier May 24, 2023, 12:49 p.m. UTC | #5
On Thu, 18 May 2023 11:09:14 +0100, Jean-Philippe Brucker wrote:
> Another fun locking puzzle, between the new config_lock and srcu.
> Patch 1 attempts to fix it, and the other patches fix simpler issues.
> 
> I got these lockdep reports while running KVM QEMU on a TCG QEMU, but it
> can also be triggered by running the vgic_irq kselftest on TCG QEMU.
> Now, with the fix and lockdep enabled, vgic_irq hangs but I believe it's
> an unrelated weirdness: if I introduce a separate lockdep warning for
> some made up locks, then the test passes again. So I'm sending this out
> now for discussion, and will investigate that one later.
> 
> [...]

Applied to fixes, thanks!

[1/4] KVM: arm64: vgic: Fix a circular locking issue
      commit: 59112e9c390be595224e427827475a6cd3726021
[2/4] KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
      commit: 9cf2f840c439b6b23bd99f584f2917ca425ae406
[3/4] KVM: arm64: vgic: Fix locking comment
      commit: c38b8400aef99d63be2b1ff131bb993465dcafe1
[4/4] KVM: arm64: vgic: Fix a comment
      commit: 62548732260976ca88fcb17ef98ab661e7ce7504

Cheers,

	M.