diff mbox series

IRQ affinity not working on Xen pci-platform device

Message ID 07866eaf6354dd43d87cffb6eebf101716845b66.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series IRQ affinity not working on Xen pci-platform device | expand

Commit Message

David Woodhouse March 3, 2023, 3:16 p.m. UTC
I added the 'xen_no_vector_callback' kernel parameter a while back
(commit b36b0fe96af) to ensure we could test that more for Linux
guests.

Most of my testing at the time was done with just two CPUs, and I
happened to just test it with four. It fails, because the IRQ isn't
actually affine to CPU0.

I tried making it work anyway (in line with the comment in platform-
pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
as it processes events *for* CPU0). That didn't seem to work.

If I put the irq_set_affinity() call *before* the request_irq() that
does actually work. But it's setting affinity on an IRQ it doesn't even
own yet.

Test hacks below; this is testable with today's QEMU master (yay!) and:

  qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
     -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
     -kernel ~/git/linux/arch/x86/boot//bzImage \
     -append "console=ttyS0,115200 xen_no_vector_callback"

...

[    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
[    0.578149] The affinity mask was 0-3
[    0.579081] The affinity mask is 0-3 and the handler is on 2
[    0.580288] The affinity mask is 0 and the handler is on 2

Comments

Thomas Gleixner March 3, 2023, 4:51 p.m. UTC | #1
David!

On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote:
> I added the 'xen_no_vector_callback' kernel parameter a while back
> (commit b36b0fe96af) to ensure we could test that more for Linux
> guests.
>
> Most of my testing at the time was done with just two CPUs, and I
> happened to just test it with four. It fails, because the IRQ isn't
> actually affine to CPU0.
>
> I tried making it work anyway (in line with the comment in platform-
> pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
> as it processes events *for* CPU0). That didn't seem to work.
>
> If I put the irq_set_affinity() call *before* the request_irq() that
> does actually work. But it's setting affinity on an IRQ it doesn't even
> own yet.

The core allows it for raisins. See below... :)

> Test hacks below; this is testable with today's QEMU master (yay!) and:
>
>   qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
>      -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
>      -kernel ~/git/linux/arch/x86/boot//bzImage \
>      -append "console=ttyS0,115200 xen_no_vector_callback"
>
> ...
>
> [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> [    0.578149] The affinity mask was 0-3
> [    0.579081] The affinity mask is 0-3 and the handler is on 2
> [    0.580288] The affinity mask is 0 and the handler is on 2

What happens is that once the interrupt is requested, the affinity
setting is deferred to the first interrupt. See the marvelous dance in
arch/x86/kernel/apic/msi.c::msi_set_affinity().

If you do the setting before request_irq() then the startup will assign
it to the target mask right away.

Btw, you are using irq_get_affinity_mask(), which gives you the desired
target mask. irq_get_effective_affinity_mask() gives you the real one.

Can you verify that the thing moves over after the first interrupt or is
that too late already?

Thanks,

        tglx
David Woodhouse March 3, 2023, 4:54 p.m. UTC | #2
On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> David!
> 
> On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote:
> > I added the 'xen_no_vector_callback' kernel parameter a while back
> > (commit b36b0fe96af) to ensure we could test that more for Linux
> > guests.
> > 
> > Most of my testing at the time was done with just two CPUs, and I
> > happened to just test it with four. It fails, because the IRQ isn't
> > actually affine to CPU0.
> > 
> > I tried making it work anyway (in line with the comment in platform-
> > pci.c which says that it shouldn't matter if it *runs* on CPU0 as long
> > as it processes events *for* CPU0). That didn't seem to work.
> > 
> > If I put the irq_set_affinity() call *before* the request_irq() that
> > does actually work. But it's setting affinity on an IRQ it doesn't even
> > own yet.
> 
> The core allows it for raisins. See below... :)
> 
> > Test hacks below; this is testable with today's QEMU master (yay!) and:
> > 
> >   qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \
> >      -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
> >      -kernel ~/git/linux/arch/x86/boot//bzImage \
> >      -append "console=ttyS0,115200 xen_no_vector_callback"
> > 
> > ...
> > 
> > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > [    0.578149] The affinity mask was 0-3
> > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > [    0.580288] The affinity mask is 0 and the handler is on 2
> 
> What happens is that once the interrupt is requested, the affinity
> setting is deferred to the first interrupt. See the marvelous dance in
> arch/x86/kernel/apic/msi.c::msi_set_affinity().
> 
> If you do the setting before request_irq() then the startup will assign
> it to the target mask right away.
> 
> Btw, you are using irq_get_affinity_mask(), which gives you the desired
> target mask. irq_get_effective_affinity_mask() gives you the real one.
> 
> Can you verify that the thing moves over after the first interrupt or is
> that too late already?

It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
CPU != 0 was intended to do just that. It's a level-triggered interrupt
so when the handler does nothing on the "wrong" CPU, it ought to get
invoked again on the *correct* CPU and actually work that time.

But no, as the above logs show, it gets invoked twice *both* on CPU2.

If I do the setting before request_irq() then it should assign it right
away (unless that IRQ was already in use? It's theoretically a shared
PCI INTx line). But even then, that would mean I'm messing with
affinity on an IRQ I haven't even requested yet and don't own?
Thomas Gleixner March 4, 2023, 12:28 a.m. UTC | #3
David!

On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
>> > 
>> > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
>> > [    0.578149] The affinity mask was 0-3
>> > [    0.579081] The affinity mask is 0-3 and the handler is on 2
>> > [    0.580288] The affinity mask is 0 and the handler is on 2
>> 
>> What happens is that once the interrupt is requested, the affinity
>> setting is deferred to the first interrupt. See the marvelous dance in
>> arch/x86/kernel/apic/msi.c::msi_set_affinity().
>> 
>> If you do the setting before request_irq() then the startup will assign
>> it to the target mask right away.
>> 
>> Btw, you are using irq_get_affinity_mask(), which gives you the desired
>> target mask. irq_get_effective_affinity_mask() gives you the real one.
>> 
>> Can you verify that the thing moves over after the first interrupt or is
>> that too late already?
>
> It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> CPU != 0 was intended to do just that. It's a level-triggered interrupt
> so when the handler does nothing on the "wrong" CPU, it ought to get
> invoked again on the *correct* CPU and actually work that time.

So much for the theory. This is virt after all so it does not
necessarily behave like real hardware.

> But no, as the above logs show, it gets invoked twice *both* on CPU2.

Duh. I missed that. Can you instrument whether this ends up in in the
actual irq affinity setter function of the underlying irq chip at all?

> If I do the setting before request_irq() then it should assign it right
> away (unless that IRQ was already in use?

Correct.

> It's theoretically a shared PCI INTx line). But even then, that would
> mean I'm messing with affinity on an IRQ I haven't even requested yet
> and don't own?

Well, that's not any different from userspace changing the affinity of
an interrupt.

Thanks,

        tglx
David Woodhouse March 4, 2023, 9:57 a.m. UTC | #4
On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote:
> David!
> 
> On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> > > > 
> > > > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > > [    0.578149] The affinity mask was 0-3
> > > > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > > > [    0.580288] The affinity mask is 0 and the handler is on 2
> > > 
> > > What happens is that once the interrupt is requested, the affinity
> > > setting is deferred to the first interrupt. See the marvelous dance in
> > > arch/x86/kernel/apic/msi.c::msi_set_affinity().
> > > 
> > > If you do the setting before request_irq() then the startup will assign
> > > it to the target mask right away.
> > > 
> > > Btw, you are using irq_get_affinity_mask(), which gives you the desired
> > > target mask. irq_get_effective_affinity_mask() gives you the real one.
> > > 
> > > Can you verify that the thing moves over after the first interrupt or is
> > > that too late already?
> > 
> > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> > CPU != 0 was intended to do just that. It's a level-triggered interrupt
> > so when the handler does nothing on the "wrong" CPU, it ought to get
> > invoked again on the *correct* CPU and actually work that time.
> 
> So much for the theory. This is virt after all so it does not
> necessarily behave like real hardware.

I think you're right. This looks like a QEMU bug with the "split
irqchip" I/OAPIC.

For reasons I'm unclear about, and which lack a comment in the code,
QEMU still injects I/OAPIC events into the kernel with kvm_set_irq().
(I think it's do to with caching, because QEMU doesn't cache interrupt-
remapping translations anywhere *except* in the KVM IRQ routing table,
so if it just synthesised an MSI message every time it'd have to
retranslate it every time?)

Tracing the behaviour here shows:

 • First interrupt happens on CPU2.
 • Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't
   update the KVM IRQ routing table yet.
 * QEMU retriggers the (still-high, level triggered) IRQ.
 • QEMU calls kvm_set_irq(11), delivering it to CPU2 again.
 • QEMU *finally* calls ioapic_update_kvm_routes().
 • Linux sees the interrupt on CPU2 again.

  $ qemu-system-x86_64 -display none -serial mon:stdio \
     -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
     -kernel ~/git/linux/arch/x86/boot//bzImage \
     -append "console=ttyS0,115200 xen_no_vector_callback" \
     -smp 4 --trace ioapic\* --trace xenstore\*


...

xenstore_read tx 0 path control/platform-feature-xs_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
[    0.523627] The affinity mask was 0-3 and the handler is on 2
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
xenstore_reset_watches 
ioapic_set_irq vector: 11 level: 1
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
[    0.524569] ioapic_ack_level IRQ 11 moveit = 1
ioapic_eoi_broadcast EOI broadcast for vector 33
ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
[    0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
[    0.526147] irq_do_set_affinity for IRQ 11, 0
[    0.526732] ioapic_set_affinity for IRQ 11, 0
[    0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
[    0.527623] ioapic_set_affinity returns 0
[    0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
[    0.529571] The affinity mask was 0 and the handler is on 2
[    xenstore_watch path memory/target token FFFFFFFF92847D40
xenstore_watch_event path memory/target token FFFFFFFF92847D40
ioapic_set_irq vector: 11 level: 1
0.530486] ioapic_ack_level IRQ 11 moveit = 0


This is with Linux doing basically nothing when the handler is invoked
on the 'wrong' CPU, and just waiting for it to be right.

Commenting out the kvm_set_irq() calls in ioapic_service() and letting
QEMU synthesise an MSI every time works. Better still, so does this,
making it update the routing table *before* retriggering the IRQ when
the guest updates the RTE:

--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
                 s->ioredtbl[index] |= ro_bits;
                 s->irq_eoi[index] = 0;
                 ioapic_fix_edge_remote_irr(&s->ioredtbl[index]);
+                ioapic_update_kvm_routes(s);
                 ioapic_service(s);
             }
         }
@@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
         break;
     }
 
-    ioapic_update_kvm_routes(s);
 }
 
 static const MemoryRegionOps ioapic_io_ops = {



Now, I don't quite see why we don't get a *third* interrupt, since
Linux did nothing to clear the level of IRQ 11 and the last trace I see
from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded
my screen time for the day, so I'll have to frown at that part some
more later. I wonder if the EOI is going missing because it's coming
from the wrong CPU? Note no 'EOI broadcast' after the last line in the
log I showed above; it isn't just that I trimmed it there.

I don't think we need to do anything in Linux; if the handler gets
invoked on the wrong CPU it'll basically find no events pending for
that CPU and return having done nothing... and *hopefully* should be
re-invoked on the correct CPU shortly thereafter.
David Woodhouse March 4, 2023, 10:15 a.m. UTC | #5
On Sat, 2023-03-04 at 09:57 +0000, David Woodhouse wrote:
> I wonder if the EOI is going missing because it's coming
> from the wrong CPU? Note no 'EOI broadcast' after the last line in the
> log I showed above; it isn't just that I trimmed it there.

I'm running on a host kernel without commit fceb3a36c29a so that's
probably it.

https://git.kernel.org/torvalds/c/fceb3a36c29a
diff mbox series

Patch

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd452..e3d159f1eb86 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1712,11 +1712,12 @@  void handle_irq_for_port(evtchn_port_t port, struct evtchn_loop_ctrl *ctrl)
 
 static int __xen_evtchn_do_upcall(void)
 {
-	struct vcpu_info *vcpu_info = __this_cpu_read(xen_vcpu);
+	struct vcpu_info *vcpu_info = per_cpu(xen_vcpu, 0);
 	int ret = vcpu_info->evtchn_upcall_pending ? IRQ_HANDLED : IRQ_NONE;
-	int cpu = smp_processor_id();
+	int cpu = 0;//smp_processor_id();
 	struct evtchn_loop_ctrl ctrl = { 0 };
 
+	WARN_ON_ONCE(smp_processor_id() != 0);
 	read_lock(&evtchn_rwlock);
 
 	do {
diff --git a/drivers/xen/platform-pci.c b/drivers/xen/platform-pci.c
index fcc819131572..647991211633 100644
--- a/drivers/xen/platform-pci.c
+++ b/drivers/xen/platform-pci.c
@@ -64,6 +64,16 @@  static uint64_t get_callback_via(struct pci_dev *pdev)
 
 static irqreturn_t do_hvm_evtchn_intr(int irq, void *dev_id)
 {
+	struct pci_dev *pdev = dev_id;
+
+	if (unlikely(smp_processor_id())) {
+		const struct cpumask *mask = irq_get_affinity_mask(pdev->irq);
+		if (mask)
+			printk("The affinity mask is %*pbl and the handler is on %d\n",
+			       cpumask_pr_args(mask), smp_processor_id());
+		return IRQ_NONE;
+	}
+
 	return xen_hvm_evtchn_do_upcall();
 }
 
@@ -132,6 +142,12 @@  static int platform_pci_probe(struct pci_dev *pdev,
 			dev_warn(&pdev->dev, "request_irq failed err=%d\n", ret);
 			goto out;
 		}
+
+		const struct cpumask *mask = irq_get_affinity_mask(pdev->irq);
+		if (mask)
+			printk("The affinity mask was %*pbl\n",
+			       cpumask_pr_args(mask));
+
 		/*
 		 * It doesn't strictly *have* to run on CPU0 but it sure
 		 * as hell better process the event channel ports delivered