Message ID | 07866eaf6354dd43d87cffb6eebf101716845b66.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | IRQ affinity not working on Xen pci-platform device | expand |
David! On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote: > I added the 'xen_no_vector_callback' kernel parameter a while back > (commit b36b0fe96af) to ensure we could test that more for Linux > guests. > > Most of my testing at the time was done with just two CPUs, and I > happened to just test it with four. It fails, because the IRQ isn't > actually affine to CPU0. > > I tried making it work anyway (in line with the comment in platform- > pci.c which says that it shouldn't matter if it *runs* on CPU0 as long > as it processes events *for* CPU0). That didn't seem to work. > > If I put the irq_set_affinity() call *before* the request_irq() that > does actually work. But it's setting affinity on an IRQ it doesn't even > own yet. The core allows it for raisins. See below... :) > Test hacks below; this is testable with today's QEMU master (yay!) and: > > qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \ > -accel kvm,xen-version=0x4000a,kernel-irqchip=split \ > -kernel ~/git/linux/arch/x86/boot//bzImage \ > -append "console=ttyS0,115200 xen_no_vector_callback" > > ... > > [ 0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11 > [ 0.578149] The affinity mask was 0-3 > [ 0.579081] The affinity mask is 0-3 and the handler is on 2 > [ 0.580288] The affinity mask is 0 and the handler is on 2 What happens is that once the interrupt is requested, the affinity setting is deferred to the first interrupt. See the marvelous dance in arch/x86/kernel/apic/msi.c::msi_set_affinity(). If you do the setting before request_irq() then the startup will assign it to the target mask right away. Btw, you are using irq_get_affinity_mask(), which gives you the desired target mask. irq_get_effective_affinity_mask() gives you the real one. Can you verify that the thing moves over after the first interrupt or is that too late already? Thanks, tglx
On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote: > David! > > On Fri, Mar 03 2023 at 15:16, David Woodhouse wrote: > > I added the 'xen_no_vector_callback' kernel parameter a while back > > (commit b36b0fe96af) to ensure we could test that more for Linux > > guests. > > > > Most of my testing at the time was done with just two CPUs, and I > > happened to just test it with four. It fails, because the IRQ isn't > > actually affine to CPU0. > > > > I tried making it work anyway (in line with the comment in platform- > > pci.c which says that it shouldn't matter if it *runs* on CPU0 as long > > as it processes events *for* CPU0). That didn't seem to work. > > > > If I put the irq_set_affinity() call *before* the request_irq() that > > does actually work. But it's setting affinity on an IRQ it doesn't even > > own yet. > > The core allows it for raisins. See below... :) > > > Test hacks below; this is testable with today's QEMU master (yay!) and: > > > > qemu-system-x86_64 -display none -serial mon:stdio -smp 4 \ > > -accel kvm,xen-version=0x4000a,kernel-irqchip=split \ > > -kernel ~/git/linux/arch/x86/boot//bzImage \ > > -append "console=ttyS0,115200 xen_no_vector_callback" > > > > ... > > > > [ 0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11 > > [ 0.578149] The affinity mask was 0-3 > > [ 0.579081] The affinity mask is 0-3 and the handler is on 2 > > [ 0.580288] The affinity mask is 0 and the handler is on 2 > > What happens is that once the interrupt is requested, the affinity > setting is deferred to the first interrupt. See the marvelous dance in > arch/x86/kernel/apic/msi.c::msi_set_affinity(). > > If you do the setting before request_irq() then the startup will assign > it to the target mask right away. > > Btw, you are using irq_get_affinity_mask(), which gives you the desired > target mask. irq_get_effective_affinity_mask() gives you the real one. > > Can you verify that the thing moves over after the first interrupt or is > that too late already? It doesn't seem to move. The hack to just return IRQ_NONE if invoked on CPU != 0 was intended to do just that. It's a level-triggered interrupt so when the handler does nothing on the "wrong" CPU, it ought to get invoked again on the *correct* CPU and actually work that time. But no, as the above logs show, it gets invoked twice *both* on CPU2. If I do the setting before request_irq() then it should assign it right away (unless that IRQ was already in use? It's theoretically a shared PCI INTx line). But even then, that would mean I'm messing with affinity on an IRQ I haven't even requested yet and don't own?
David! On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote: > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote: >> > >> > [ 0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11 >> > [ 0.578149] The affinity mask was 0-3 >> > [ 0.579081] The affinity mask is 0-3 and the handler is on 2 >> > [ 0.580288] The affinity mask is 0 and the handler is on 2 >> >> What happens is that once the interrupt is requested, the affinity >> setting is deferred to the first interrupt. See the marvelous dance in >> arch/x86/kernel/apic/msi.c::msi_set_affinity(). >> >> If you do the setting before request_irq() then the startup will assign >> it to the target mask right away. >> >> Btw, you are using irq_get_affinity_mask(), which gives you the desired >> target mask. irq_get_effective_affinity_mask() gives you the real one. >> >> Can you verify that the thing moves over after the first interrupt or is >> that too late already? > > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on > CPU != 0 was intended to do just that. It's a level-triggered interrupt > so when the handler does nothing on the "wrong" CPU, it ought to get > invoked again on the *correct* CPU and actually work that time. So much for the theory. This is virt after all so it does not necessarily behave like real hardware. > But no, as the above logs show, it gets invoked twice *both* on CPU2. Duh. I missed that. Can you instrument whether this ends up in in the actual irq affinity setter function of the underlying irq chip at all? > If I do the setting before request_irq() then it should assign it right > away (unless that IRQ was already in use? Correct. > It's theoretically a shared PCI INTx line). But even then, that would > mean I'm messing with affinity on an IRQ I haven't even requested yet > and don't own? Well, that's not any different from userspace changing the affinity of an interrupt. Thanks, tglx
On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote: > David! > > On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote: > > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote: > > > > > > > > [ 0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11 > > > > [ 0.578149] The affinity mask was 0-3 > > > > [ 0.579081] The affinity mask is 0-3 and the handler is on 2 > > > > [ 0.580288] The affinity mask is 0 and the handler is on 2 > > > > > > What happens is that once the interrupt is requested, the affinity > > > setting is deferred to the first interrupt. See the marvelous dance in > > > arch/x86/kernel/apic/msi.c::msi_set_affinity(). > > > > > > If you do the setting before request_irq() then the startup will assign > > > it to the target mask right away. > > > > > > Btw, you are using irq_get_affinity_mask(), which gives you the desired > > > target mask. irq_get_effective_affinity_mask() gives you the real one. > > > > > > Can you verify that the thing moves over after the first interrupt or is > > > that too late already? > > > > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on > > CPU != 0 was intended to do just that. It's a level-triggered interrupt > > so when the handler does nothing on the "wrong" CPU, it ought to get > > invoked again on the *correct* CPU and actually work that time. > > So much for the theory. This is virt after all so it does not > necessarily behave like real hardware. I think you're right. This looks like a QEMU bug with the "split irqchip" I/OAPIC. For reasons I'm unclear about, and which lack a comment in the code, QEMU still injects I/OAPIC events into the kernel with kvm_set_irq(). (I think it's do to with caching, because QEMU doesn't cache interrupt- remapping translations anywhere *except* in the KVM IRQ routing table, so if it just synthesised an MSI message every time it'd have to retranslate it every time?) Tracing the behaviour here shows: • First interrupt happens on CPU2. • Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't update the KVM IRQ routing table yet. * QEMU retriggers the (still-high, level triggered) IRQ. • QEMU calls kvm_set_irq(11), delivering it to CPU2 again. • QEMU *finally* calls ioapic_update_kvm_routes(). • Linux sees the interrupt on CPU2 again. $ qemu-system-x86_64 -display none -serial mon:stdio \ -accel kvm,xen-version=0x4000a,kernel-irqchip=split \ -kernel ~/git/linux/arch/x86/boot//bzImage \ -append "console=ttyS0,115200 xen_no_vector_callback" \ -smp 4 --trace ioapic\* --trace xenstore\* ... xenstore_read tx 0 path control/platform-feature-xs_reset_watches ioapic_set_irq vector: 11 level: 1 ioapic_set_remote_irr set remote irr for pin 11 ioapic_service: trigger KVM IRQ 11 [ 0.523627] The affinity mask was 0-3 and the handler is on 2 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 xenstore_reset_watches ioapic_set_irq vector: 11 level: 1 ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021 [ 0.524569] ioapic_ack_level IRQ 11 moveit = 1 ioapic_eoi_broadcast EOI broadcast for vector 33 ioapic_clear_remote_irr clear remote irr for pin 11 vector 33 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021 [ 0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq() [ 0.526147] irq_do_set_affinity for IRQ 11, 0 [ 0.526732] ioapic_set_affinity for IRQ 11, 0 [ 0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 [ 0.527623] ioapic_set_affinity returns 0 [ 0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq() ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021 ioapic_set_remote_irr set remote irr for pin 11 ioapic_service: trigger KVM IRQ 11 ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021 [ 0.529571] The affinity mask was 0 and the handler is on 2 [ xenstore_watch path memory/target token FFFFFFFF92847D40 xenstore_watch_event path memory/target token FFFFFFFF92847D40 ioapic_set_irq vector: 11 level: 1 0.530486] ioapic_ack_level IRQ 11 moveit = 0 This is with Linux doing basically nothing when the handler is invoked on the 'wrong' CPU, and just waiting for it to be right. Commenting out the kvm_set_irq() calls in ioapic_service() and letting QEMU synthesise an MSI every time works. Better still, so does this, making it update the routing table *before* retriggering the IRQ when the guest updates the RTE: --- a/hw/intc/ioapic.c +++ b/hw/intc/ioapic.c @@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, s->ioredtbl[index] |= ro_bits; s->irq_eoi[index] = 0; ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); + ioapic_update_kvm_routes(s); ioapic_service(s); } } @@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, break; } - ioapic_update_kvm_routes(s); } static const MemoryRegionOps ioapic_io_ops = { Now, I don't quite see why we don't get a *third* interrupt, since Linux did nothing to clear the level of IRQ 11 and the last trace I see from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded my screen time for the day, so I'll have to frown at that part some more later. I wonder if the EOI is going missing because it's coming from the wrong CPU? Note no 'EOI broadcast' after the last line in the log I showed above; it isn't just that I trimmed it there. I don't think we need to do anything in Linux; if the handler gets invoked on the wrong CPU it'll basically find no events pending for that CPU and return having done nothing... and *hopefully* should be re-invoked on the correct CPU shortly thereafter.
On Sat, 2023-03-04 at 09:57 +0000, David Woodhouse wrote: > I wonder if the EOI is going missing because it's coming > from the wrong CPU? Note no 'EOI broadcast' after the last line in the > log I showed above; it isn't just that I trimmed it there. I'm running on a host kernel without commit fceb3a36c29a so that's probably it. https://git.kernel.org/torvalds/c/fceb3a36c29a
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c7715f8bd452..e3d159f1eb86 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1712,11 +1712,12 @@ void handle_irq_for_port(evtchn_port_t port, struct evtchn_loop_ctrl *ctrl) static int __xen_evtchn_do_upcall(void) { - struct vcpu_info *vcpu_info = __this_cpu_read(xen_vcpu); + struct vcpu_info *vcpu_info = per_cpu(xen_vcpu, 0); int ret = vcpu_info->evtchn_upcall_pending ? IRQ_HANDLED : IRQ_NONE; - int cpu = smp_processor_id(); + int cpu = 0;//smp_processor_id(); struct evtchn_loop_ctrl ctrl = { 0 }; + WARN_ON_ONCE(smp_processor_id() != 0); read_lock(&evtchn_rwlock); do { diff --git a/drivers/xen/platform-pci.c b/drivers/xen/platform-pci.c index fcc819131572..647991211633 100644 --- a/drivers/xen/platform-pci.c +++ b/drivers/xen/platform-pci.c @@ -64,6 +64,16 @@ static uint64_t get_callback_via(struct pci_dev *pdev) static irqreturn_t do_hvm_evtchn_intr(int irq, void *dev_id) { + struct pci_dev *pdev = dev_id; + + if (unlikely(smp_processor_id())) { + const struct cpumask *mask = irq_get_affinity_mask(pdev->irq); + if (mask) + printk("The affinity mask is %*pbl and the handler is on %d\n", + cpumask_pr_args(mask), smp_processor_id()); + return IRQ_NONE; + } + return xen_hvm_evtchn_do_upcall(); } @@ -132,6 +142,12 @@ static int platform_pci_probe(struct pci_dev *pdev, dev_warn(&pdev->dev, "request_irq failed err=%d\n", ret); goto out; } + + const struct cpumask *mask = irq_get_affinity_mask(pdev->irq); + if (mask) + printk("The affinity mask was %*pbl\n", + cpumask_pr_args(mask)); + /* * It doesn't strictly *have* to run on CPU0 but it sure * as hell better process the event channel ports delivered