mbox series

[for-4.20,v3,0/5] xen/x86: prevent local APIC errors at shutdown

Message ID 20250211110209.86974-1-roger.pau@citrix.com (mailing list archive)
Headers show
Series xen/x86: prevent local APIC errors at shutdown | expand

Message

Roger Pau Monné Feb. 11, 2025, 11:02 a.m. UTC
Hello,

The following series aims to prevent local APIC errors from stalling the
shtudown process.  On XenServer testing we have seen reports of AMD
boxes sporadically getting stuck in a spam of:

APIC error on CPU0: 00(08), Receive accept error

Messages during shutdown, as a result of device interrupts targeting
CPUs that are offline (and have the local APIC disabled).

First patch strictly solves the issue of shutdown getting stuck, further
patches aim to quiesce interrupts from all devices (known by Xen) as an
attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
make kexec more reliable.

Thanks, Roger.

Roger Pau Monne (5):
  x86/shutdown: offline APs with interrupts disabled on all CPUs
  x86/irq: drop fixup_irqs() parameters
  x86/smp: perform disabling on interrupts ahead of AP shutdown
  x86/pci: disable MSI(-X) on all devices at shutdown
  x86/iommu: disable interrupts at shutdown

 xen/arch/x86/crash.c                        |  8 +++++
 xen/arch/x86/include/asm/irq.h              |  4 +--
 xen/arch/x86/include/asm/msi.h              |  1 +
 xen/arch/x86/irq.c                          | 30 +++++++---------
 xen/arch/x86/msi.c                          | 18 ++++++++++
 xen/arch/x86/smp.c                          | 33 ++++++++++++-----
 xen/arch/x86/smpboot.c                      |  2 +-
 xen/drivers/passthrough/amd/iommu.h         |  1 +
 xen/drivers/passthrough/amd/iommu_init.c    | 17 +++++++++
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  1 +
 xen/drivers/passthrough/iommu.c             | 12 +++++++
 xen/drivers/passthrough/pci.c               | 39 +++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.c         | 19 ++++++++++
 xen/include/xen/iommu.h                     |  3 ++
 xen/include/xen/pci.h                       |  7 ++++
 15 files changed, 166 insertions(+), 29 deletions(-)

Comments

Roger Pau Monné Feb. 11, 2025, 6:39 p.m. UTC | #1
On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote:
> Hello,
> 
> The following series aims to prevent local APIC errors from stalling the
> shtudown process.  On XenServer testing we have seen reports of AMD
> boxes sporadically getting stuck in a spam of:
> 
> APIC error on CPU0: 00(08), Receive accept error
> 
> Messages during shutdown, as a result of device interrupts targeting
> CPUs that are offline (and have the local APIC disabled).
> 
> First patch strictly solves the issue of shutdown getting stuck, further
> patches aim to quiesce interrupts from all devices (known by Xen) as an
> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
> make kexec more reliable.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (5):
>   x86/shutdown: offline APs with interrupts disabled on all CPUs
>   x86/irq: drop fixup_irqs() parameters
>   x86/smp: perform disabling on interrupts ahead of AP shutdown
>   x86/pci: disable MSI(-X) on all devices at shutdown
>   x86/iommu: disable interrupts at shutdown

This is now fully reviewed, can I get your opinion (and
release-acked-by) on which patches we should take for 4.20?

Thanks, Roger.
Oleksii Kurochko Feb. 12, 2025, 8:33 a.m. UTC | #2
On 2/11/25 7:39 PM, Roger Pau Monné wrote:
> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote:
>> Hello,
>>
>> The following series aims to prevent local APIC errors from stalling the
>> shtudown process.  On XenServer testing we have seen reports of AMD
>> boxes sporadically getting stuck in a spam of:
>>
>> APIC error on CPU0: 00(08), Receive accept error
>>
>> Messages during shutdown, as a result of device interrupts targeting
>> CPUs that are offline (and have the local APIC disabled).
>>
>> First patch strictly solves the issue of shutdown getting stuck, further
>> patches aim to quiesce interrupts from all devices (known by Xen) as an
>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
>> make kexec more reliable.
>>
>> Thanks, Roger.
>>
>> Roger Pau Monne (5):
>>    x86/shutdown: offline APs with interrupts disabled on all CPUs
>>    x86/irq: drop fixup_irqs() parameters
>>    x86/smp: perform disabling on interrupts ahead of AP shutdown
>>    x86/pci: disable MSI(-X) on all devices at shutdown
>>    x86/iommu: disable interrupts at shutdown
> This is now fully reviewed, can I get your opinion (and
> release-acked-by) on which patches we should take for 4.20?

If my understanding is correct to unblock shutdown process, it is enough just
to have only first patch merged, correct? So the first patch should be merged.

As second patch doesn't have functional changes, IMO, it could be merged to
despite of the fact we have Hard code freeze period.

All other patches, I would like to ask additional opinion (as I am an expert in x86),
at first glance it looks like an absence of these patches in staging branch will
lead only to triggering "Receive accept error" which I believe won't block shutdown
process, so these patches could be postponed until 4.21. On other side, if it is
low-risk fixes then we could consider to merge them now.

~ Oleksii
Jan Beulich Feb. 12, 2025, 8:51 a.m. UTC | #3
On 12.02.2025 09:33, Oleksii Kurochko wrote:
> 
> On 2/11/25 7:39 PM, Roger Pau Monné wrote:
>> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote:
>>> Hello,
>>>
>>> The following series aims to prevent local APIC errors from stalling the
>>> shtudown process.  On XenServer testing we have seen reports of AMD
>>> boxes sporadically getting stuck in a spam of:
>>>
>>> APIC error on CPU0: 00(08), Receive accept error
>>>
>>> Messages during shutdown, as a result of device interrupts targeting
>>> CPUs that are offline (and have the local APIC disabled).
>>>
>>> First patch strictly solves the issue of shutdown getting stuck, further
>>> patches aim to quiesce interrupts from all devices (known by Xen) as an
>>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
>>> make kexec more reliable.
>>>
>>> Thanks, Roger.
>>>
>>> Roger Pau Monne (5):
>>>    x86/shutdown: offline APs with interrupts disabled on all CPUs
>>>    x86/irq: drop fixup_irqs() parameters
>>>    x86/smp: perform disabling on interrupts ahead of AP shutdown
>>>    x86/pci: disable MSI(-X) on all devices at shutdown
>>>    x86/iommu: disable interrupts at shutdown
>> This is now fully reviewed, can I get your opinion (and
>> release-acked-by) on which patches we should take for 4.20?
> 
> If my understanding is correct to unblock shutdown process, it is enough just
> to have only first patch merged, correct? So the first patch should be merged.
> 
> As second patch doesn't have functional changes, IMO, it could be merged to
> despite of the fact we have Hard code freeze period.
> 
> All other patches, I would like to ask additional opinion (as I am an expert in x86),
> at first glance it looks like an absence of these patches in staging branch will
> lead only to triggering "Receive accept error" which I believe won't block shutdown
> process, so these patches could be postponed until 4.21. On other side, if it is
> low-risk fixes then we could consider to merge them now.

I'm not Roger, but as a data point: While I'm uncertain about patch 2, all
others in this series will very likely be backported anyway.

Jan
Roger Pau Monné Feb. 12, 2025, 9:25 a.m. UTC | #4
On Wed, Feb 12, 2025 at 09:51:16AM +0100, Jan Beulich wrote:
> On 12.02.2025 09:33, Oleksii Kurochko wrote:
> > 
> > On 2/11/25 7:39 PM, Roger Pau Monné wrote:
> >> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote:
> >>> Hello,
> >>>
> >>> The following series aims to prevent local APIC errors from stalling the
> >>> shtudown process.  On XenServer testing we have seen reports of AMD
> >>> boxes sporadically getting stuck in a spam of:
> >>>
> >>> APIC error on CPU0: 00(08), Receive accept error
> >>>
> >>> Messages during shutdown, as a result of device interrupts targeting
> >>> CPUs that are offline (and have the local APIC disabled).
> >>>
> >>> First patch strictly solves the issue of shutdown getting stuck, further
> >>> patches aim to quiesce interrupts from all devices (known by Xen) as an
> >>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
> >>> make kexec more reliable.
> >>>
> >>> Thanks, Roger.
> >>>
> >>> Roger Pau Monne (5):
> >>>    x86/shutdown: offline APs with interrupts disabled on all CPUs
> >>>    x86/irq: drop fixup_irqs() parameters
> >>>    x86/smp: perform disabling on interrupts ahead of AP shutdown
> >>>    x86/pci: disable MSI(-X) on all devices at shutdown
> >>>    x86/iommu: disable interrupts at shutdown
> >> This is now fully reviewed, can I get your opinion (and
> >> release-acked-by) on which patches we should take for 4.20?
> > 
> > If my understanding is correct to unblock shutdown process, it is enough just
> > to have only first patch merged, correct? So the first patch should be merged.
> > 
> > As second patch doesn't have functional changes, IMO, it could be merged to
> > despite of the fact we have Hard code freeze period.
> > 
> > All other patches, I would like to ask additional opinion (as I am an expert in x86),
> > at first glance it looks like an absence of these patches in staging branch will
> > lead only to triggering "Receive accept error" which I believe won't block shutdown
> > process, so these patches could be postponed until 4.21. On other side, if it is
> > low-risk fixes then we could consider to merge them now.

I expect the following patches might make kexec'ing from Xen a bit
more reliable, as the kexec'ed kernel should find an environment with
interrupts from all Xen known devices quiesced.

> I'm not Roger, but as a data point: While I'm uncertain about patch 2, all
> others in this series will very likely be backported anyway.

I plan to backport the series to the XenServer patch queue also when it
goes in.

Thanks, Roger.
Oleksii Kurochko Feb. 12, 2025, 2:25 p.m. UTC | #5
On 2/12/25 10:25 AM, Roger Pau Monné wrote:
> On Wed, Feb 12, 2025 at 09:51:16AM +0100, Jan Beulich wrote:
>> On 12.02.2025 09:33, Oleksii Kurochko wrote:
>>> On 2/11/25 7:39 PM, Roger Pau Monné wrote:
>>>> On Tue, Feb 11, 2025 at 12:02:04PM +0100, Roger Pau Monne wrote:
>>>>> Hello,
>>>>>
>>>>> The following series aims to prevent local APIC errors from stalling the
>>>>> shtudown process.  On XenServer testing we have seen reports of AMD
>>>>> boxes sporadically getting stuck in a spam of:
>>>>>
>>>>> APIC error on CPU0: 00(08), Receive accept error
>>>>>
>>>>> Messages during shutdown, as a result of device interrupts targeting
>>>>> CPUs that are offline (and have the local APIC disabled).
>>>>>
>>>>> First patch strictly solves the issue of shutdown getting stuck, further
>>>>> patches aim to quiesce interrupts from all devices (known by Xen) as an
>>>>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
>>>>> make kexec more reliable.
>>>>>
>>>>> Thanks, Roger.
>>>>>
>>>>> Roger Pau Monne (5):
>>>>>     x86/shutdown: offline APs with interrupts disabled on all CPUs
>>>>>     x86/irq: drop fixup_irqs() parameters
>>>>>     x86/smp: perform disabling on interrupts ahead of AP shutdown
>>>>>     x86/pci: disable MSI(-X) on all devices at shutdown
>>>>>     x86/iommu: disable interrupts at shutdown
>>>> This is now fully reviewed, can I get your opinion (and
>>>> release-acked-by) on which patches we should take for 4.20?
>>> If my understanding is correct to unblock shutdown process, it is enough just
>>> to have only first patch merged, correct? So the first patch should be merged.
>>>
>>> As second patch doesn't have functional changes, IMO, it could be merged to
>>> despite of the fact we have Hard code freeze period.
>>>
>>> All other patches, I would like to ask additional opinion (as I am an expert in x86),
>>> at first glance it looks like an absence of these patches in staging branch will
>>> lead only to triggering "Receive accept error" which I believe won't block shutdown
>>> process, so these patches could be postponed until 4.21. On other side, if it is
>>> low-risk fixes then we could consider to merge them now.
> I expect the following patches might make kexec'ing from Xen a bit
> more reliable, as the kexec'ed kernel should find an environment with
> interrupts from all Xen known devices quiesced.
>
>> I'm not Roger, but as a data point: While I'm uncertain about patch 2, all
>> others in this series will very likely be backported anyway.
> I plan to backport the series to the XenServer patch queue also when it
> goes in.

If it is likely to be backported anyway, then let's merge the patch series now:
   Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

~Oleksii

>
> Thanks, Roger.