diff mbox

KVM: arm/arm64: BUG: Fix losing level-sensitive interrupts

Message ID 1440571563-7004-1-git-send-email-p.fedin@samsung.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pavel Fedin Aug. 26, 2015, 6:46 a.m. UTC
Commit 71760950bf3dc796e5e53ea3300dec724a09f593
("arm/arm64: KVM: add a common vgic_queue_irq_to_lr fn") introduced
vgic_queue_irq_to_lr() function which checks vgic_dist_irq_is_pending()
before setting LR_STATE_PENDING bit. However, in some cases, the following
race condition is possible:
1. Userland injects an IRQ with level == 1, this ends up in
   vgic_update_irq_pending(), which in turn calls
   vgic_dist_irq_set_pending() for this IRQ.
2. vCPU gets kicked. But kernel does not manage to reschedule it quickly
   (!!!)
3. Userland quickly resets the IRQ to level == 0. vgic_update_irq_pending()
   in this case will call vgic_dist_irq_clear_pending() and reset the
   pending flag.
4. vCPU finally wakes up. It successfully rolls through through
   __kvm_vgic_flush_hwstate(), which populates vGIC registers. Before the
   aforementioned commit LR_STATE_PENDING bit was set unconditionally, and
   nothing bad happened. However, now vgic_queue_irq_to_lr() does not set
   any state bits on this LR at all, because vgic_dist_irq_is_pending()
   returns zero (it was reset in step 3). Since this is level-sensitive
   IRQ, we end up in LR containing only LR_EOI_INT bit. The guest will not
   get this interrupt.

This patch fixes the problem by bringing back unconditional setting of
LR_STATE_PENDING bit.

The bug was caught on Cavium ThunderX machine, kernel v4.1.6, running
qemu "virt" guest, where it affected pl011 driver.

Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
---
 virt/kvm/arm/vgic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Marc Zyngier Aug. 26, 2015, 8:27 a.m. UTC | #1
On Wed, 26 Aug 2015 09:46:03 +0300
Pavel Fedin <p.fedin@samsung.com> wrote:

Hi Pavel,

> Commit 71760950bf3dc796e5e53ea3300dec724a09f593
> ("arm/arm64: KVM: add a common vgic_queue_irq_to_lr fn") introduced
> vgic_queue_irq_to_lr() function which checks vgic_dist_irq_is_pending()
> before setting LR_STATE_PENDING bit. However, in some cases, the following
> race condition is possible:
> 1. Userland injects an IRQ with level == 1, this ends up in
>    vgic_update_irq_pending(), which in turn calls
>    vgic_dist_irq_set_pending() for this IRQ.
> 2. vCPU gets kicked. But kernel does not manage to reschedule it quickly
>    (!!!)
> 3. Userland quickly resets the IRQ to level == 0. vgic_update_irq_pending()
>    in this case will call vgic_dist_irq_clear_pending() and reset the
>    pending flag.

So userspace drops the line to 0 *before* the guest had a chance to do
anything? Well, this is not the expected behaviour for a level
triggered interrupt, which should look like this:

- device raises the interrupt line
- guest takes the interrupt
- guest pokes the device to clear the interrupt condition
- device lowers the line

The behaviour you describe is that of an edge triggered interrupt, and
it is not surprising at all that you loose interrupts.

This really feels like a userspace bug to me (I vaguely remember some
QEMU issues regarding this a while ago, but my memory is a bit hazy).
Christoffer?

	M.
Pavel Fedin Aug. 26, 2015, 10:58 a.m. UTC | #2
Hello!

> So userspace drops the line to 0 *before* the guest had a chance to do
> anything? Well, this is not the expected behaviour for a level
> triggered interrupt

 I know. But, still...
 Imagine that we have misconfigured the HW for some reason. The device pulses an IRQ line, but we
think it's a level IRQ. What will happen in a real hardware? Not much, the interrupt will still be
sampled.
 So, for better modelling the hardware, shouldn't we improve KVM's behavior here? Especially if
before v4.1 it actually did not have this problem.

> This really feels like a userspace bug to me (I vaguely remember some
> QEMU issues regarding this a while ago, but my memory is a bit hazy).

 You know, may be it's really qemu's problem, to tell the truth i'm lazy to read the whole PL011
spec, but qemu appears to pulse the line without PL011 interrupt servicing at all. I know this
because my kernel is patched, it uses software emulation of vCPU interface, because vGIC is broken
on ThunderX. And LR state change and all the maintenance is done upon EOIR write (which is trapped).
With this change consequences of losing an interrupt are much more severe, the IRQ line get stuck
and stops working at all. Subsequent injections are blocked by vgic_can_sample_irq(), which returns
false because vgic_irq_is_queued() returns true. Because vgic_irq_clear_queued() is called during
maintenance procedure, which in this case never happens, because the interrupt is never EOIed,
because it was never made PENDING in the LR. Actually that's how i found this.
 So, here is why i am describing these unrelated things here: with IRQ line processing completely
locked up, line switches between 1 and 0 is still injected (vgic_update_irq_pending() is called with
both values, i added some debug output in order to see this). The guest successfully boots up to a
login prompt, everything is fine, just i cannot type anything on the console because serial port's
interrupt is locked up. I suppose that this pulsing has to do with output FIFO. Could this be some
bug in kernel's pl011 driver itself, which does something wrong and does not handle interrupts in a
proper way during output?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc Zyngier Aug. 26, 2015, 11:13 a.m. UTC | #3
On 26/08/15 11:58, Pavel Fedin wrote:
>  Hello!
> 
>> So userspace drops the line to 0 *before* the guest had a chance to do
>> anything? Well, this is not the expected behaviour for a level
>> triggered interrupt
> 
>  I know. But, still...
>  Imagine that we have misconfigured the HW for some reason. The device pulses an IRQ line, but we
> think it's a level IRQ. What will happen in a real hardware? Not much, the interrupt will still be
> sampled.
>  So, for better modelling the hardware, shouldn't we improve KVM's behavior here? Especially if
> before v4.1 it actually did not have this problem.

I'm sorry, but that's actually a very accurate model of the HW. You
misconfigure the line trigger, you loose interrupts. This happens on
real HW all the time. And if you haven't seen that before, you haven't
tried very hard.

As for v4.1 not having that problem, the pl011 driver has gone though a
lot if rework lately, and I wouldn't be surprised if it now exhibited a
different behaviour thanks to the broken userspace behaviour.

> 
>> This really feels like a userspace bug to me (I vaguely remember some
>> QEMU issues regarding this a while ago, but my memory is a bit hazy).
> 
>  You know, may be it's really qemu's problem, to tell the truth i'm lazy to read the whole PL011
> spec, but qemu appears to pulse the line without PL011 interrupt servicing at all. I know this
> because my kernel is patched, it uses software emulation of vCPU interface, because vGIC is broken
> on ThunderX. And LR state change and all the maintenance is done upon EOIR write (which is trapped).
> With this change consequences of losing an interrupt are much more severe, the IRQ line get stuck
> and stops working at all. Subsequent injections are blocked by vgic_can_sample_irq(), which returns
> false because vgic_irq_is_queued() returns true. Because vgic_irq_clear_queued() is called during
> maintenance procedure, which in this case never happens, because the interrupt is never EOIed,
> because it was never made PENDING in the LR. Actually that's how i found this.

TL;DR.

You're using a different code base, broken HW, and what is apparently a
buggy userspace. Sorry, but I don't really want to introduce another bug
in the VGIC code (we have too many already). And what you're suggesting
is to actually introduce a bug.

Thanks,

	M.
Pavel Fedin Aug. 26, 2015, 11:33 a.m. UTC | #4
Hello!

> As for v4.1 not having that problem, the pl011 driver has gone though a
> lot if rework lately, and I wouldn't be surprised if it now exhibited a
> different behaviour thanks to the broken userspace behaviour.

 Sorry, you misunderstood me. Or i wrote badly. I meant that _KVM_ did not have this particular
problem in kernel v4.0, because:
http://lxr.free-electrons.com/source/virt/kvm/arm/vgic.c?v=4.0#L998
 you see, LR_STATE_PENDING is assigned unconditionally. Is this code correct? I believe yes. Compare
with:
http://lxr.free-electrons.com/source/virt/kvm/arm/vgic.c#L1104
 Now it is possible to have neither PENDING nor ACTIVE irq. Does it even make sense? So what is
wrong with the modification as follows?
--- cut ---
         if (vgic_irq_is_active(vcpu, irq)) {
                 vlr.state |= LR_STATE_ACTIVE;
                 kvm_debug("Set active, clear distributor: 0x%x\n", vlr.state);
                 vgic_irq_clear_active(vcpu, irq);
                 vgic_update_state(vcpu->kvm);
         } else {
                 vlr.state |= LR_STATE_PENDING;
                 kvm_debug("Set pending: 0x%x\n", vlr.state);
         }
--- cut ---
 Alex, are you reading us? Can you explain, why you introduced that extra check?

> And what you're suggesting is to actually introduce a bug.

 Why would that be a bug, if it was not a bug in kernel 4.0?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Fedin Aug. 26, 2015, 1:11 p.m. UTC | #5
Hello!

> Sorry, but I don't really want to introduce another bug
> in the VGIC code (we have too many already). And what you're suggesting
> is to actually introduce a bug.

 Another, alternate idea...
 So far, we have a situation when empty LR, containing only LR_INT_EOI bit, is queued. Can we say
that this is wrong?
 If you agree, may be do something else instead? May be we should cancel such "ghost" interrupts
early, avoiding immediate and completely unnecessary maintenance interrupts upon guest entry?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Aug. 26, 2015, 2:03 p.m. UTC | #6
On Wed, Aug 26, 2015 at 09:27:21AM +0100, Marc Zyngier wrote:
> On Wed, 26 Aug 2015 09:46:03 +0300
> Pavel Fedin <p.fedin@samsung.com> wrote:
> 
> Hi Pavel,
> 
> > Commit 71760950bf3dc796e5e53ea3300dec724a09f593
> > ("arm/arm64: KVM: add a common vgic_queue_irq_to_lr fn") introduced
> > vgic_queue_irq_to_lr() function which checks vgic_dist_irq_is_pending()
> > before setting LR_STATE_PENDING bit. However, in some cases, the following
> > race condition is possible:
> > 1. Userland injects an IRQ with level == 1, this ends up in
> >    vgic_update_irq_pending(), which in turn calls
> >    vgic_dist_irq_set_pending() for this IRQ.
> > 2. vCPU gets kicked. But kernel does not manage to reschedule it quickly
> >    (!!!)
> > 3. Userland quickly resets the IRQ to level == 0. vgic_update_irq_pending()
> >    in this case will call vgic_dist_irq_clear_pending() and reset the
> >    pending flag.
> 
> So userspace drops the line to 0 *before* the guest had a chance to do
> anything? Well, this is not the expected behaviour for a level
> triggered interrupt, which should look like this:
> 
> - device raises the interrupt line
> - guest takes the interrupt
> - guest pokes the device to clear the interrupt condition
> - device lowers the line
> 
> The behaviour you describe is that of an edge triggered interrupt, and
> it is not surprising at all that you loose interrupts.
> 
> This really feels like a userspace bug to me (I vaguely remember some
> QEMU issues regarding this a while ago, but my memory is a bit hazy).
> Christoffer?
> 
I think it's perfectly valid for userspace to raise and lower a level
triggered interrupt at will for some device emulation.

But it is inconsistent to get to a point in the vgic code where we try
to queue something which is neither active nor pending.  See my reply to
the original patch.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index fdcad86..90d1671 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1111,7 +1111,7 @@  static void vgic_queue_irq_to_lr(struct kvm_vcpu *vcpu, int irq,
 		kvm_debug("Set active, clear distributor: 0x%x\n", vlr.state);
 		vgic_irq_clear_active(vcpu, irq);
 		vgic_update_state(vcpu->kvm);
-	} else if (vgic_dist_irq_is_pending(vcpu, irq)) {
+	} else {
 		vlr.state |= LR_STATE_PENDING;
 		kvm_debug("Set pending: 0x%x\n", vlr.state);
 	}