Message ID | 20250206150615.52052-1-roger.pau@citrix.com (mailing list archive) |
---|---|
Headers | show |
Series | xen/x86: prevent local APIC errors at shutdown | expand |
Hello, This should have had a 'for-4.20?' tag in the subject name, as otherwise we will need to add an errata to the release notes to notice that reboot can sometimes fail on AMD boxes. Also adding Oleksii. Thanks, Roger. On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote: > Hello, > > The following series aims to prevent local APIC errors from stalling the > shtudown process. On XenServer testing we have seen reports of AMD > boxes sporadically getting stuck in a spam of: > > APIC error on CPU0: 00(08), Receive accept error > > Messages during shutdown, as a result of device interrupts targeting > CPUs that are offline (and have the local APIC disabled). > > First patch strictly solves the issue of shutdown getting stuck, further > patches aim to quiesce interrupts from all devices (known by Xen) as an > attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also > make kexec more reliable. > > Thanks, Roger. > > Roger Pau Monne (5): > x86/shutdown: offline APs with interrupts disabled on all CPUs > x86/irq: drop fixup_irqs() parameters > x86/smp: perform disabling on interrupts ahead of AP shutdown > x86/pci: disable MSI(-X) on all devices at shutdown > x86/iommu: disable interrupts at shutdown > > xen/arch/x86/crash.c | 2 ++ > xen/arch/x86/include/asm/irq.h | 4 +-- > xen/arch/x86/include/asm/msi.h | 1 + > xen/arch/x86/irq.c | 30 ++++++++----------- > xen/arch/x86/msi.c | 18 +++++++++++ > xen/arch/x86/smp.c | 33 +++++++++++++++------ > xen/arch/x86/smpboot.c | 2 +- > xen/drivers/passthrough/amd/iommu.h | 1 + > xen/drivers/passthrough/amd/iommu_init.c | 17 +++++++++++ > xen/drivers/passthrough/amd/pci_amd_iommu.c | 1 + > xen/drivers/passthrough/iommu.c | 6 ++++ > xen/drivers/passthrough/pci.c | 33 +++++++++++++++++++++ > xen/drivers/passthrough/vtd/iommu.c | 19 ++++++++++++ > xen/include/xen/iommu.h | 3 ++ > xen/include/xen/pci.h | 4 +++ > 15 files changed, 145 insertions(+), 29 deletions(-) > > -- > 2.46.0 >
Hello Roger, On 2/10/25 11:02 AM, Roger Pau Monné wrote: > Hello, > > This should have had a 'for-4.20?' tag in the subject name, as > otherwise we will need to add an errata to the release notes to notice > that reboot can sometimes fail on AMD boxes. > > Also adding Oleksii. > > Thanks, Roger. > > On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote: >> Hello, >> >> The following series aims to prevent local APIC errors from stalling the >> shtudown process. On XenServer testing we have seen reports of AMD >> boxes sporadically getting stuck in a spam of: How often this issue happen? >> >> APIC error on CPU0: 00(08), Receive accept error >> >> Messages during shutdown, as a result of device interrupts targeting >> CPUs that are offline (and have the local APIC disabled). >> >> First patch strictly solves the issue of shutdown getting stuck, further >> patches aim to quiesce interrupts from all devices (known by Xen) as an >> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also >> make kexec more reliable. If the first patch solves does it make sense to consider, at least, it to be merged? ~ Oleksii >> >> Thanks, Roger. >> >> Roger Pau Monne (5): >> x86/shutdown: offline APs with interrupts disabled on all CPUs >> x86/irq: drop fixup_irqs() parameters >> x86/smp: perform disabling on interrupts ahead of AP shutdown >> x86/pci: disable MSI(-X) on all devices at shutdown >> x86/iommu: disable interrupts at shutdown >> >> xen/arch/x86/crash.c | 2 ++ >> xen/arch/x86/include/asm/irq.h | 4 +-- >> xen/arch/x86/include/asm/msi.h | 1 + >> xen/arch/x86/irq.c | 30 ++++++++----------- >> xen/arch/x86/msi.c | 18 +++++++++++ >> xen/arch/x86/smp.c | 33 +++++++++++++++------ >> xen/arch/x86/smpboot.c | 2 +- >> xen/drivers/passthrough/amd/iommu.h | 1 + >> xen/drivers/passthrough/amd/iommu_init.c | 17 +++++++++++ >> xen/drivers/passthrough/amd/pci_amd_iommu.c | 1 + >> xen/drivers/passthrough/iommu.c | 6 ++++ >> xen/drivers/passthrough/pci.c | 33 +++++++++++++++++++++ >> xen/drivers/passthrough/vtd/iommu.c | 19 ++++++++++++ >> xen/include/xen/iommu.h | 3 ++ >> xen/include/xen/pci.h | 4 +++ >> 15 files changed, 145 insertions(+), 29 deletions(-) >> >> -- >> 2.46.0 >>
On 06.02.2025 16:06, Roger Pau Monne wrote: > The following series aims to prevent local APIC errors from stalling the > shtudown process. On XenServer testing we have seen reports of AMD > boxes sporadically getting stuck in a spam of: > > APIC error on CPU0: 00(08), Receive accept error > > Messages during shutdown, as a result of device interrupts targeting > CPUs that are offline (and have the local APIC disabled). One more thought here: Have you/we perhaps discovered the reason why there was that 1ms delay at the end of fixup_irqs() that was badly commented, and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts to new destination in fixup_irqs()")? May be worth mentioning that by way of a Fixes: tag. Jan
On Mon, Feb 10, 2025 at 07:29:35PM +0100, Oleksii Kurochko wrote: > Hello Roger, > > On 2/10/25 11:02 AM, Roger Pau Monné wrote: > > Hello, > > > > This should have had a 'for-4.20?' tag in the subject name, as > > otherwise we will need to add an errata to the release notes to notice > > that reboot can sometimes fail on AMD boxes. > > > > Also adding Oleksii. > > > > Thanks, Roger. > > > > On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote: > > > Hello, > > > > > > The following series aims to prevent local APIC errors from stalling the > > > shtudown process. On XenServer testing we have seen reports of AMD > > > boxes sporadically getting stuck in a spam of: > > How often this issue happen? Hard to tell, we have certainly hit it more than once on XenRT, but I don't have figures about its probability. I have at least 14 reports from XenRT from the last 6 months, but there's possibly a lot more that could have been classified as a different kind of issue. > > > > > > APIC error on CPU0: 00(08), Receive accept error > > > > > > Messages during shutdown, as a result of device interrupts targeting > > > CPUs that are offline (and have the local APIC disabled). > > > > > > First patch strictly solves the issue of shutdown getting stuck, further > > > patches aim to quiesce interrupts from all devices (known by Xen) as an > > > attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also > > > make kexec more reliable. > > If the first patch solves does it make sense to consider, at least, it to be merged? First one sure, the rest I think are also worth considering. They get rid of the resulting innocuous "APIC error on CPU0: 00(00)" message. Also remaining patches are likely to provide the kexec kernel with a better context, as they quiesce interrupts from devices. I will send a new version soon, hopefully we can discuss over that one which patches we want to pick. With my XenServer hat on I plan to backport the whole series into our patch queue. Thanks, Roger.
On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote: > On 06.02.2025 16:06, Roger Pau Monne wrote: > > The following series aims to prevent local APIC errors from stalling the > > shtudown process. On XenServer testing we have seen reports of AMD > > boxes sporadically getting stuck in a spam of: > > > > APIC error on CPU0: 00(08), Receive accept error > > > > Messages during shutdown, as a result of device interrupts targeting > > CPUs that are offline (and have the local APIC disabled). > > One more thought here: Have you/we perhaps discovered the reason why there > was that 1ms delay at the end of fixup_irqs() that was badly commented, > and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts > to new destination in fixup_irqs()")? May be worth mentioning that by way > of a Fixes: tag. Hm, so you think the delay was added there as a way to ensure any pending interrupts would get drained (ie: serviced) on the old target? I'm maybe a bit confused, but I don't think the delay would help much with preventing the local APIC errors? Regardless of the wait, if the interrupts target offline CPUs there's a chance receive accept errors will be triggered on AMD. Thanks, Roger.
On 11.02.2025 09:50, Roger Pau Monné wrote: > On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote: >> On 06.02.2025 16:06, Roger Pau Monne wrote: >>> The following series aims to prevent local APIC errors from stalling the >>> shtudown process. On XenServer testing we have seen reports of AMD >>> boxes sporadically getting stuck in a spam of: >>> >>> APIC error on CPU0: 00(08), Receive accept error >>> >>> Messages during shutdown, as a result of device interrupts targeting >>> CPUs that are offline (and have the local APIC disabled). >> >> One more thought here: Have you/we perhaps discovered the reason why there >> was that 1ms delay at the end of fixup_irqs() that was badly commented, >> and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts >> to new destination in fixup_irqs()")? May be worth mentioning that by way >> of a Fixes: tag. > > Hm, so you think the delay was added there as a way to ensure any > pending interrupts would get drained (ie: serviced) on the old target? So far I didn't have the slightest idea why that call had been there. This at least gives a possible reason. > I'm maybe a bit confused, but I don't think the delay would help much > with preventing the local APIC errors? Regardless of the wait, if the > interrupts target offline CPUs there's a chance receive accept errors > will be triggered on AMD. But fixup_irqs() right now runs ahead of actually offlining the APs. Jan
On 11.02.2025 09:38, Roger Pau Monné wrote: > On Mon, Feb 10, 2025 at 07:29:35PM +0100, Oleksii Kurochko wrote: >> On 2/10/25 11:02 AM, Roger Pau Monné wrote: >>> This should have had a 'for-4.20?' tag in the subject name, as >>> otherwise we will need to add an errata to the release notes to notice >>> that reboot can sometimes fail on AMD boxes. >>> >>> Also adding Oleksii. >>> >>> Thanks, Roger. >>> >>> On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote: >>>> Hello, >>>> >>>> The following series aims to prevent local APIC errors from stalling the >>>> shtudown process. On XenServer testing we have seen reports of AMD >>>> boxes sporadically getting stuck in a spam of: >> >> How often this issue happen? > > Hard to tell, we have certainly hit it more than once on XenRT, but > I don't have figures about its probability. I have at least 14 > reports from XenRT from the last 6 months, but there's possibly a lot > more that could have been classified as a different kind of issue. Our QA tells me that this is severely hindering their work. Jan