IRQ affinity not working on Xen pci-platform device

Message ID	07866eaf6354dd43d87cffb6eebf101716845b66.camel@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <07866eaf6354dd43d87cffb6eebf101716845b66.camel@infradead.org> Subject: IRQ affinity not working on Xen pci-platform device From: David Woodhouse <dwmw2@infradead.org> To: Thomas Gleixner <tglx@linutronix.de>, linux-kernel <linux-kernel@vger.kernel.org>, xen-devel <xen-devel@lists.xenproject.org> Date: Fri, 03 Mar 2023 15:16:11 +0000 Content-Type: multipart/signed; micalg="sha-256"; protocol="application/pkcs7-signature"; boundary="=-180Js8B6IU80WeBwXwfQ" User-Agent: Evolution 3.44.4-0ubuntu1 MIME-Version: 1.0
Series	IRQ affinity not working on Xen pci-platform device \| expand IRQ affinity not working on Xen pci-platform device

David Woodhouse March 3, 2023, 3:16 p.m. UTC

I added the 'xen_no_vector_callback' kernel parameter a while back
(commit b36b0fe96af) to ensure we could test that more for Linux
guests.

Most of my testing at the time was done with just two CPUs, and I
happened to just test it with four. It fails, because the IRQ isn't
actually affine to CPU0.

I tried making it work anyway (in line with the comment in platform-
pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
as it processes events *for* CPU0). That didn't seem to work.

If I put the irq_set_affinity() call *before* the request_irq() that
does actually work. But it's setting affinity on an IRQ it doesn't even
own yet.

Test hacks below; this is testable with today's QEMU master (yay!) and:

  qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
     -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
     -kernel ~/git/linux/arch/x86/boot//bzImage \
     -append "console=ttyS0,115200 xen_no_vector_callback"

...

[    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
[    0.578149] The affinity mask was 0-3
[    0.579081] The affinity mask is 0-3 and the handler is on 2
[    0.580288] The affinity mask is 0 and the handler is on 2

Thomas Gleixner March 3, 2023, 4:51 p.m. UTC | #1

David!

On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote:
> I added the 'xen_no_vector_callback' kernel parameter a while back
> (commit b36b0fe96af) to ensure we could test that more for Linux
> guests.
>
> Most of my testing at the time was done with just two CPUs, and I
> happened to just test it with four. It fails, because the IRQ isn't
> actually affine to CPU0.
>
> I tried making it work anyway (in line with the comment in platform-
> pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
> as it processes events *for* CPU0). That didn't seem to work.
>
> If I put the irq_set_affinity() call *before* the request_irq() that
> does actually work. But it's setting affinity on an IRQ it doesn't even
> own yet.

The core allows it for raisins. See below... :)

> Test hacks below; this is testable with today's QEMU master (yay!) and:
>
>   qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
>      -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
>      -kernel ~/git/linux/arch/x86/boot//bzImage \
>      -append "console=ttyS0,115200 xen_no_vector_callback"
>
> ...
>
> [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> [    0.578149] The affinity mask was 0-3
> [    0.579081] The affinity mask is 0-3 and the handler is on 2
> [    0.580288] The affinity mask is 0 and the handler is on 2

What happens is that once the interrupt is requested, the affinity
setting is deferred to the first interrupt. See the marvelous dance in
arch/x86/kernel/apic/msi.c::msi_set_affinity().

If you do the setting before request_irq() then the startup will assign
it to the target mask right away.

Btw, you are using irq_get_affinity_mask(), which gives you the desired
target mask. irq_get_effective_affinity_mask() gives you the real one.

Can you verify that the thing moves over after the first interrupt or is
that too late already?

Thanks,

        tglx

David Woodhouse March 3, 2023, 4:54 p.m. UTC | #2

On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> David!
> 
> On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote:
> > I added the 'xen_no_vector_callback' kernel parameter a while back
> > (commit b36b0fe96af) to ensure we could test that more for Linux
> > guests.
> > 
> > Most of my testing at the time was done with just two CPUs, and I
> > happened to just test it with four. It fails, because the IRQ isn't
> > actually affine to CPU0.
> > 
> > I tried making it work anyway (in line with the comment in platform-
> > pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
> > as it processes events *for* CPU0). That didn't seem to work.
> > 
> > If I put the irq_set_affinity() call *before* the request_irq() that
> > does actually work. But it's setting affinity on an IRQ it doesn't even
> > own yet.
> 
> The core allows it for raisins. See below... :)
> 
> > Test hacks below; this is testable with today's QEMU master (yay!) and:
> > 
> >   qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
> >      -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
> >      -kernel ~/git/linux/arch/x86/boot//bzImage \
> >      -append "console=ttyS0,115200 xen_no_vector_callback"
> > 
> > ...
> > 
> > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > [    0.578149] The affinity mask was 0-3
> > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > [    0.580288] The affinity mask is 0 and the handler is on 2
> 
> What happens is that once the interrupt is requested, the affinity
> setting is deferred to the first interrupt. See the marvelous dance in
> arch/x86/kernel/apic/msi.c::msi_set_affinity().
> 
> If you do the setting before request_irq() then the startup will assign
> it to the target mask right away.
> 
> Btw, you are using irq_get_affinity_mask(), which gives you the desired
> target mask. irq_get_effective_affinity_mask() gives you the real one.
> 
> Can you verify that the thing moves over after the first interrupt or is
> that too late already?

It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
CPU != 0 was intended to do just that. It's a level-triggered interrupt
so when the handler does nothing on the "wrong" CPU, it ought to get
invoked again on the *correct* CPU and actually work that time.

But no, as the above logs show, it gets invoked twice *both* on CPU2.

If I do the setting before request_irq() then it should assign it right
away (unless that IRQ was already in use? It's theoretically a shared
PCI INTx line). But even then, that would mean I'm messing with
affinity on an IRQ I haven't even requested yet and don't own?

Thomas Gleixner March 4, 2023, 12:28 a.m. UTC | #3

David!

On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
>> > 
>> > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
>> > [    0.578149] The affinity mask was 0-3
>> > [    0.579081] The affinity mask is 0-3 and the handler is on 2
>> > [    0.580288] The affinity mask is 0 and the handler is on 2
>> 
>> What happens is that once the interrupt is requested, the affinity
>> setting is deferred to the first interrupt. See the marvelous dance in
>> arch/x86/kernel/apic/msi.c::msi_set_affinity().
>> 
>> If you do the setting before request_irq() then the startup will assign
>> it to the target mask right away.
>> 
>> Btw, you are using irq_get_affinity_mask(), which gives you the desired
>> target mask. irq_get_effective_affinity_mask() gives you the real one.
>> 
>> Can you verify that the thing moves over after the first interrupt or is
>> that too late already?
>
> It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> CPU != 0 was intended to do just that. It's a level-triggered interrupt
> so when the handler does nothing on the "wrong" CPU, it ought to get
> invoked again on the *correct* CPU and actually work that time.

So much for the theory. This is virt after all so it does not
necessarily behave like real hardware.

> But no, as the above logs show, it gets invoked twice *both* on CPU2.

Duh. I missed that. Can you instrument whether this ends up in in the
actual irq affinity setter function of the underlying irq chip at all?

> If I do the setting before request_irq() then it should assign it right
> away (unless that IRQ was already in use?

Correct.

> It's theoretically a shared PCI INTx line). But even then, that would
> mean I'm messing with affinity on an IRQ I haven't even requested yet
> and don't own?

Well, that's not any different from userspace changing the affinity of
an interrupt.

Thanks,

        tglx

David Woodhouse March 4, 2023, 9:57 a.m. UTC | #4

On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote:
> David!
> 
> On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> > > > 
> > > > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > > [    0.578149] The affinity mask was 0-3
> > > > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > > > [    0.580288] The affinity mask is 0 and the handler is on 2
> > > 
> > > What happens is that once the interrupt is requested, the affinity
> > > setting is deferred to the first interrupt. See the marvelous dance in
> > > arch/x86/kernel/apic/msi.c::msi_set_affinity().
> > > 
> > > If you do the setting before request_irq() then the startup will assign
> > > it to the target mask right away.
> > > 
> > > Btw, you are using irq_get_affinity_mask(), which gives you the desired
> > > target mask. irq_get_effective_affinity_mask() gives you the real one.
> > > 
> > > Can you verify that the thing moves over after the first interrupt or is
> > > that too late already?
> > 
> > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> > CPU != 0 was intended to do just that. It's a level-triggered interrupt
> > so when the handler does nothing on the "wrong" CPU, it ought to get
> > invoked again on the *correct* CPU and actually work that time.
> 
> So much for the theory. This is virt after all so it does not
> necessarily behave like real hardware.

I think you're right. This looks like a QEMU bug with the "split
irqchip" I/OAPIC.

For reasons I'm unclear about, and which lack a comment in the code,
QEMU still injects I/OAPIC events into the kernel with kvm_set_irq().
(I think it's do to with caching, because QEMU doesn't cache interrupt-
remapping translations anywhere *except* in the KVM IRQ routing table,
so if it just synthesised an MSI message every time it'd have to
retranslate it every time?)

Tracing the behaviour here shows:

 • First interrupt happens on CPU2.
 • Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't
   update the KVM IRQ routing table yet.
 * QEMU retriggers the (still-high, level triggered) IRQ.
 • QEMU calls kvm_set_irq(11), delivering it to CPU2 again.
 • QEMU *finally* calls ioapic_update_kvm_routes().
 • Linux sees the interrupt on CPU2 again.

  $ qemu-system-x86_64 -display none -serial mon:stdio \
     -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
     -kernel ~/git/linux/arch/x86/boot//bzImage \
     -append "console=ttyS0,115200 xen_no_vector_callback" \
     -smp 4 --trace ioapic\* --trace xenstore\*

...

xenstore_read tx 0 path control/platform-feature-xs_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
[    0.523627] The affinity mask was 0-3 and the handler is on 2
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
xenstore_reset_watches 
ioapic_set_irq vector: 11 level: 1
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
[    0.524569] ioapic_ack_level IRQ 11 moveit = 1
ioapic_eoi_broadcast EOI broadcast for vector 33
ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
[    0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
[    0.526147] irq_do_set_affinity for IRQ 11, 0
[    0.526732] ioapic_set_affinity for IRQ 11, 0
[    0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
[    0.527623] ioapic_set_affinity returns 0
[    0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
[    0.529571] The affinity mask was 0 and the handler is on 2
[    xenstore_watch path memory/target token FFFFFFFF92847D40
xenstore_watch_event path memory/target token FFFFFFFF92847D40
ioapic_set_irq vector: 11 level: 1
0.530486] ioapic_ack_level IRQ 11 moveit = 0

This is with Linux doing basically nothing when the handler is invoked
on the 'wrong' CPU, and just waiting for it to be right.

Commenting out the kvm_set_irq() calls in ioapic_service() and letting
QEMU synthesise an MSI every time works. Better still, so does this,
making it update the routing table *before* retriggering the IRQ when
the guest updates the RTE:

--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
                 s->ioredtbl[index] |= ro_bits;
                 s->irq_eoi[index] = 0;
                 ioapic_fix_edge_remote_irr(&s->ioredtbl[index]);
+                ioapic_update_kvm_routes(s);
                 ioapic_service(s);
             }
         }
@@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
         break;
     }

-    ioapic_update_kvm_routes(s);
 }

 static const MemoryRegionOps ioapic_io_ops = {

Now, I don't quite see why we don't get a *third* interrupt, since
Linux did nothing to clear the level of IRQ 11 and the last trace I see
from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded
my screen time for the day, so I'll have to frown at that part some
more later. I wonder if the EOI is going missing because it's coming
from the wrong CPU? Note no 'EOI broadcast' after the last line in the
log I showed above; it isn't just that I trimmed it there.

I don't think we need to do anything in Linux; if the handler gets
invoked on the wrong CPU it'll basically find no events pending for
that CPU and return having done nothing... and *hopefully* should be
re-invoked on the correct CPU shortly thereafter.

David Woodhouse March 4, 2023, 10:15 a.m. UTC | #5

On Sat, 2023-03-04 at 09:57 +0000, David Woodhouse wrote:
> I wonder if the EOI is going missing because it's coming
> from the wrong CPU? Note no 'EOI broadcast' after the last line in the
> log I showed above; it isn't just that I trimmed it there.

I'm running on a host kernel without commit fceb3a36c29a so that's
probably it.

https://git.kernel.org/torvalds/c/fceb3a36c29a

IRQ affinity not working on Xen pci-platform device

Commit Message

Comments

Patch