Message ID | 20250211110209.86974-1-roger.pau@citrix.com (mailing list archive) |
---|---|
Headers | show |
Series | xen/x86: prevent local APIC errors at shutdown | expand |
On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote: > Hello, > > The following series aims to prevent local APIC errors from stalling the > shtudown process. On XenServer testing we have seen reports of AMD > boxes sporadically getting stuck in a spam of: > > APIC error on CPU0: 00(08), Receive accept error > > Messages during shutdown, as a result of device interrupts targeting > CPUs that are offline (and have the local APIC disabled). > > First patch strictly solves the issue of shutdown getting stuck, further > patches aim to quiesce interrupts from all devices (known by Xen) as an > attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also > make kexec more reliable. > > Thanks, Roger. > > Roger Pau Monne (5): > x86/shutdown: offline APs with interrupts disabled on all CPUs > x86/irq: drop fixup_irqs() parameters > x86/smp: perform disabling on interrupts ahead of AP shutdown > x86/pci: disable MSI(-X) on all devices at shutdown > x86/iommu: disable interrupts at shutdown This is now fully reviewed, can I get your opinion (and release-acked-by) on which patches we should take for 4.20? Thanks, Roger.
On 2/11/25 7:39 PM, Roger Pau Monné wrote: > On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote: >> Hello, >> >> The following series aims to prevent local APIC errors from stalling the >> shtudown process. On XenServer testing we have seen reports of AMD >> boxes sporadically getting stuck in a spam of: >> >> APIC error on CPU0: 00(08), Receive accept error >> >> Messages during shutdown, as a result of device interrupts targeting >> CPUs that are offline (and have the local APIC disabled). >> >> First patch strictly solves the issue of shutdown getting stuck, further >> patches aim to quiesce interrupts from all devices (known by Xen) as an >> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also >> make kexec more reliable. >> >> Thanks, Roger. >> >> Roger Pau Monne (5): >> x86/shutdown: offline APs with interrupts disabled on all CPUs >> x86/irq: drop fixup_irqs() parameters >> x86/smp: perform disabling on interrupts ahead of AP shutdown >> x86/pci: disable MSI(-X) on all devices at shutdown >> x86/iommu: disable interrupts at shutdown > This is now fully reviewed, can I get your opinion (and > release-acked-by) on which patches we should take for 4.20? If my understanding is correct to unblock shutdown process, it is enough just to have only first patch merged, correct? So the first patch should be merged. As second patch doesn't have functional changes, IMO, it could be merged to despite of the fact we have Hard code freeze period. All other patches, I would like to ask additional opinion (as I am an expert in x86), at first glance it looks like an absence of these patches in staging branch will lead only to triggering "Receive accept error" which I believe won't block shutdown process, so these patches could be postponed until 4.21. On other side, if it is low-risk fixes then we could consider to merge them now. ~ Oleksii
On 12.02.2025 09:33, Oleksii Kurochko wrote: > > On 2/11/25 7:39 PM, Roger Pau Monné wrote: >> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote: >>> Hello, >>> >>> The following series aims to prevent local APIC errors from stalling the >>> shtudown process. On XenServer testing we have seen reports of AMD >>> boxes sporadically getting stuck in a spam of: >>> >>> APIC error on CPU0: 00(08), Receive accept error >>> >>> Messages during shutdown, as a result of device interrupts targeting >>> CPUs that are offline (and have the local APIC disabled). >>> >>> First patch strictly solves the issue of shutdown getting stuck, further >>> patches aim to quiesce interrupts from all devices (known by Xen) as an >>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also >>> make kexec more reliable. >>> >>> Thanks, Roger. >>> >>> Roger Pau Monne (5): >>> x86/shutdown: offline APs with interrupts disabled on all CPUs >>> x86/irq: drop fixup_irqs() parameters >>> x86/smp: perform disabling on interrupts ahead of AP shutdown >>> x86/pci: disable MSI(-X) on all devices at shutdown >>> x86/iommu: disable interrupts at shutdown >> This is now fully reviewed, can I get your opinion (and >> release-acked-by) on which patches we should take for 4.20? > > If my understanding is correct to unblock shutdown process, it is enough just > to have only first patch merged, correct? So the first patch should be merged. > > As second patch doesn't have functional changes, IMO, it could be merged to > despite of the fact we have Hard code freeze period. > > All other patches, I would like to ask additional opinion (as I am an expert in x86), > at first glance it looks like an absence of these patches in staging branch will > lead only to triggering "Receive accept error" which I believe won't block shutdown > process, so these patches could be postponed until 4.21. On other side, if it is > low-risk fixes then we could consider to merge them now. I'm not Roger, but as a data point: While I'm uncertain about patch 2, all others in this series will very likely be backported anyway. Jan
On Wed, Feb 12, 2025 at 09:51:16AM +0100, Jan Beulich wrote: > On 12.02.2025 09:33, Oleksii Kurochko wrote: > > > > On 2/11/25 7:39 PM, Roger Pau Monné wrote: > >> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote: > >>> Hello, > >>> > >>> The following series aims to prevent local APIC errors from stalling the > >>> shtudown process. On XenServer testing we have seen reports of AMD > >>> boxes sporadically getting stuck in a spam of: > >>> > >>> APIC error on CPU0: 00(08), Receive accept error > >>> > >>> Messages during shutdown, as a result of device interrupts targeting > >>> CPUs that are offline (and have the local APIC disabled). > >>> > >>> First patch strictly solves the issue of shutdown getting stuck, further > >>> patches aim to quiesce interrupts from all devices (known by Xen) as an > >>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also > >>> make kexec more reliable. > >>> > >>> Thanks, Roger. > >>> > >>> Roger Pau Monne (5): > >>> x86/shutdown: offline APs with interrupts disabled on all CPUs > >>> x86/irq: drop fixup_irqs() parameters > >>> x86/smp: perform disabling on interrupts ahead of AP shutdown > >>> x86/pci: disable MSI(-X) on all devices at shutdown > >>> x86/iommu: disable interrupts at shutdown > >> This is now fully reviewed, can I get your opinion (and > >> release-acked-by) on which patches we should take for 4.20? > > > > If my understanding is correct to unblock shutdown process, it is enough just > > to have only first patch merged, correct? So the first patch should be merged. > > > > As second patch doesn't have functional changes, IMO, it could be merged to > > despite of the fact we have Hard code freeze period. > > > > All other patches, I would like to ask additional opinion (as I am an expert in x86), > > at first glance it looks like an absence of these patches in staging branch will > > lead only to triggering "Receive accept error" which I believe won't block shutdown > > process, so these patches could be postponed until 4.21. On other side, if it is > > low-risk fixes then we could consider to merge them now. I expect the following patches might make kexec'ing from Xen a bit more reliable, as the kexec'ed kernel should find an environment with interrupts from all Xen known devices quiesced. > I'm not Roger, but as a data point: While I'm uncertain about patch 2, all > others in this series will very likely be backported anyway. I plan to backport the series to the XenServer patch queue also when it goes in. Thanks, Roger.
On 2/12/25 10:25 AM, Roger Pau Monné wrote: > On Wed, Feb 12, 2025 at 09:51:16AM +0100, Jan Beulich wrote: >> On 12.02.2025 09:33, Oleksii Kurochko wrote: >>> On 2/11/25 7:39 PM, Roger Pau Monné wrote: >>>> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote: >>>>> Hello, >>>>> >>>>> The following series aims to prevent local APIC errors from stalling the >>>>> shtudown process. On XenServer testing we have seen reports of AMD >>>>> boxes sporadically getting stuck in a spam of: >>>>> >>>>> APIC error on CPU0: 00(08), Receive accept error >>>>> >>>>> Messages during shutdown, as a result of device interrupts targeting >>>>> CPUs that are offline (and have the local APIC disabled). >>>>> >>>>> First patch strictly solves the issue of shutdown getting stuck, further >>>>> patches aim to quiesce interrupts from all devices (known by Xen) as an >>>>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also >>>>> make kexec more reliable. >>>>> >>>>> Thanks, Roger. >>>>> >>>>> Roger Pau Monne (5): >>>>> x86/shutdown: offline APs with interrupts disabled on all CPUs >>>>> x86/irq: drop fixup_irqs() parameters >>>>> x86/smp: perform disabling on interrupts ahead of AP shutdown >>>>> x86/pci: disable MSI(-X) on all devices at shutdown >>>>> x86/iommu: disable interrupts at shutdown >>>> This is now fully reviewed, can I get your opinion (and >>>> release-acked-by) on which patches we should take for 4.20? >>> If my understanding is correct to unblock shutdown process, it is enough just >>> to have only first patch merged, correct? So the first patch should be merged. >>> >>> As second patch doesn't have functional changes, IMO, it could be merged to >>> despite of the fact we have Hard code freeze period. >>> >>> All other patches, I would like to ask additional opinion (as I am an expert in x86), >>> at first glance it looks like an absence of these patches in staging branch will >>> lead only to triggering "Receive accept error" which I believe won't block shutdown >>> process, so these patches could be postponed until 4.21. On other side, if it is >>> low-risk fixes then we could consider to merge them now. > I expect the following patches might make kexec'ing from Xen a bit > more reliable, as the kexec'ed kernel should find an environment with > interrupts from all Xen known devices quiesced. > >> I'm not Roger, but as a data point: While I'm uncertain about patch 2, all >> others in this series will very likely be backported anyway. > I plan to backport the series to the XenServer patch queue also when it > goes in. If it is likely to be backported anyway, then let's merge the patch series now: Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> ~Oleksii > > Thanks, Roger.