mbox series

[v2,0/5] xen/x86: prevent local APIC errors at shutdown

Message ID 20250206150615.52052-1-roger.pau@citrix.com (mailing list archive)
Headers show
Series xen/x86: prevent local APIC errors at shutdown | expand

Message

Roger Pau Monné Feb. 6, 2025, 3:06 p.m. UTC
Hello,

The following series aims to prevent local APIC errors from stalling the
shtudown process.  On XenServer testing we have seen reports of AMD
boxes sporadically getting stuck in a spam of:

APIC error on CPU0: 00(08), Receive accept error

Messages during shutdown, as a result of device interrupts targeting
CPUs that are offline (and have the local APIC disabled).

First patch strictly solves the issue of shutdown getting stuck, further
patches aim to quiesce interrupts from all devices (known by Xen) as an
attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
make kexec more reliable.

Thanks, Roger.

Roger Pau Monne (5):
  x86/shutdown: offline APs with interrupts disabled on all CPUs
  x86/irq: drop fixup_irqs() parameters
  x86/smp: perform disabling on interrupts ahead of AP shutdown
  x86/pci: disable MSI(-X) on all devices at shutdown
  x86/iommu: disable interrupts at shutdown

 xen/arch/x86/crash.c                        |  2 ++
 xen/arch/x86/include/asm/irq.h              |  4 +--
 xen/arch/x86/include/asm/msi.h              |  1 +
 xen/arch/x86/irq.c                          | 30 ++++++++-----------
 xen/arch/x86/msi.c                          | 18 +++++++++++
 xen/arch/x86/smp.c                          | 33 +++++++++++++++------
 xen/arch/x86/smpboot.c                      |  2 +-
 xen/drivers/passthrough/amd/iommu.h         |  1 +
 xen/drivers/passthrough/amd/iommu_init.c    | 17 +++++++++++
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  1 +
 xen/drivers/passthrough/iommu.c             |  6 ++++
 xen/drivers/passthrough/pci.c               | 33 +++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.c         | 19 ++++++++++++
 xen/include/xen/iommu.h                     |  3 ++
 xen/include/xen/pci.h                       |  4 +++
 15 files changed, 145 insertions(+), 29 deletions(-)

Comments

Roger Pau Monné Feb. 10, 2025, 10:02 a.m. UTC | #1
Hello,

This should have had a 'for-4.20?' tag in the subject name, as
otherwise we will need to add an errata to the release notes to notice
that reboot can sometimes fail on AMD boxes.

Also adding Oleksii.

Thanks, Roger.

On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote:
> Hello,
> 
> The following series aims to prevent local APIC errors from stalling the
> shtudown process.  On XenServer testing we have seen reports of AMD
> boxes sporadically getting stuck in a spam of:
> 
> APIC error on CPU0: 00(08), Receive accept error
> 
> Messages during shutdown, as a result of device interrupts targeting
> CPUs that are offline (and have the local APIC disabled).
> 
> First patch strictly solves the issue of shutdown getting stuck, further
> patches aim to quiesce interrupts from all devices (known by Xen) as an
> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
> make kexec more reliable.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (5):
>   x86/shutdown: offline APs with interrupts disabled on all CPUs
>   x86/irq: drop fixup_irqs() parameters
>   x86/smp: perform disabling on interrupts ahead of AP shutdown
>   x86/pci: disable MSI(-X) on all devices at shutdown
>   x86/iommu: disable interrupts at shutdown
> 
>  xen/arch/x86/crash.c                        |  2 ++
>  xen/arch/x86/include/asm/irq.h              |  4 +--
>  xen/arch/x86/include/asm/msi.h              |  1 +
>  xen/arch/x86/irq.c                          | 30 ++++++++-----------
>  xen/arch/x86/msi.c                          | 18 +++++++++++
>  xen/arch/x86/smp.c                          | 33 +++++++++++++++------
>  xen/arch/x86/smpboot.c                      |  2 +-
>  xen/drivers/passthrough/amd/iommu.h         |  1 +
>  xen/drivers/passthrough/amd/iommu_init.c    | 17 +++++++++++
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  1 +
>  xen/drivers/passthrough/iommu.c             |  6 ++++
>  xen/drivers/passthrough/pci.c               | 33 +++++++++++++++++++++
>  xen/drivers/passthrough/vtd/iommu.c         | 19 ++++++++++++
>  xen/include/xen/iommu.h                     |  3 ++
>  xen/include/xen/pci.h                       |  4 +++
>  15 files changed, 145 insertions(+), 29 deletions(-)
> 
> -- 
> 2.46.0
>
Oleksii Kurochko Feb. 10, 2025, 6:29 p.m. UTC | #2
Hello Roger,

On 2/10/25 11:02 AM, Roger Pau Monné wrote:
> Hello,
>
> This should have had a 'for-4.20?' tag in the subject name, as
> otherwise we will need to add an errata to the release notes to notice
> that reboot can sometimes fail on AMD boxes.
>
> Also adding Oleksii.
>
> Thanks, Roger.
>
> On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote:
>> Hello,
>>
>> The following series aims to prevent local APIC errors from stalling the
>> shtudown process.  On XenServer testing we have seen reports of AMD
>> boxes sporadically getting stuck in a spam of:

How often this issue happen?

>>
>> APIC error on CPU0: 00(08), Receive accept error
>>
>> Messages during shutdown, as a result of device interrupts targeting
>> CPUs that are offline (and have the local APIC disabled).
>>
>> First patch strictly solves the issue of shutdown getting stuck, further
>> patches aim to quiesce interrupts from all devices (known by Xen) as an
>> attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
>> make kexec more reliable.

If the first patch solves does it make sense to consider, at least, it to be merged?

~ Oleksii

>>
>> Thanks, Roger.
>>
>> Roger Pau Monne (5):
>>    x86/shutdown: offline APs with interrupts disabled on all CPUs
>>    x86/irq: drop fixup_irqs() parameters
>>    x86/smp: perform disabling on interrupts ahead of AP shutdown
>>    x86/pci: disable MSI(-X) on all devices at shutdown
>>    x86/iommu: disable interrupts at shutdown
>>
>>   xen/arch/x86/crash.c                        |  2 ++
>>   xen/arch/x86/include/asm/irq.h              |  4 +--
>>   xen/arch/x86/include/asm/msi.h              |  1 +
>>   xen/arch/x86/irq.c                          | 30 ++++++++-----------
>>   xen/arch/x86/msi.c                          | 18 +++++++++++
>>   xen/arch/x86/smp.c                          | 33 +++++++++++++++------
>>   xen/arch/x86/smpboot.c                      |  2 +-
>>   xen/drivers/passthrough/amd/iommu.h         |  1 +
>>   xen/drivers/passthrough/amd/iommu_init.c    | 17 +++++++++++
>>   xen/drivers/passthrough/amd/pci_amd_iommu.c |  1 +
>>   xen/drivers/passthrough/iommu.c             |  6 ++++
>>   xen/drivers/passthrough/pci.c               | 33 +++++++++++++++++++++
>>   xen/drivers/passthrough/vtd/iommu.c         | 19 ++++++++++++
>>   xen/include/xen/iommu.h                     |  3 ++
>>   xen/include/xen/pci.h                       |  4 +++
>>   15 files changed, 145 insertions(+), 29 deletions(-)
>>
>> -- 
>> 2.46.0
>>
Jan Beulich Feb. 11, 2025, 6:39 a.m. UTC | #3
On 06.02.2025 16:06, Roger Pau Monne wrote:
> The following series aims to prevent local APIC errors from stalling the
> shtudown process.  On XenServer testing we have seen reports of AMD
> boxes sporadically getting stuck in a spam of:
> 
> APIC error on CPU0: 00(08), Receive accept error
> 
> Messages during shutdown, as a result of device interrupts targeting
> CPUs that are offline (and have the local APIC disabled).

One more thought here: Have you/we perhaps discovered the reason why there
was that 1ms delay at the end of fixup_irqs() that was badly commented,
and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts
to new destination in fixup_irqs()")? May be worth mentioning that by way
of a Fixes: tag.

Jan
Roger Pau Monné Feb. 11, 2025, 8:38 a.m. UTC | #4
On Mon, Feb 10, 2025 at 07:29:35PM +0100, Oleksii Kurochko wrote:
> Hello Roger,
> 
> On 2/10/25 11:02 AM, Roger Pau Monné wrote:
> > Hello,
> > 
> > This should have had a 'for-4.20?' tag in the subject name, as
> > otherwise we will need to add an errata to the release notes to notice
> > that reboot can sometimes fail on AMD boxes.
> > 
> > Also adding Oleksii.
> > 
> > Thanks, Roger.
> > 
> > On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote:
> > > Hello,
> > > 
> > > The following series aims to prevent local APIC errors from stalling the
> > > shtudown process.  On XenServer testing we have seen reports of AMD
> > > boxes sporadically getting stuck in a spam of:
> 
> How often this issue happen?

Hard to tell, we have certainly hit it more than once on XenRT, but
I don't have figures about its probability.  I have at least 14
reports from XenRT from the last 6 months, but there's possibly a lot
more that could have been classified as a different kind of issue.

> > > 
> > > APIC error on CPU0: 00(08), Receive accept error
> > > 
> > > Messages during shutdown, as a result of device interrupts targeting
> > > CPUs that are offline (and have the local APIC disabled).
> > > 
> > > First patch strictly solves the issue of shutdown getting stuck, further
> > > patches aim to quiesce interrupts from all devices (known by Xen) as an
> > > attempt to prevent a spurious "APIC error on CPU0: 00(00)" plus also
> > > make kexec more reliable.
> 
> If the first patch solves does it make sense to consider, at least, it to be merged?

First one sure, the rest I think are also worth considering.  They get
rid of the resulting innocuous "APIC error on CPU0: 00(00)" message.
Also remaining patches are likely to provide the kexec kernel with a
better context, as they quiesce interrupts from devices.

I will send a new version soon, hopefully we can discuss over that one
which patches we want to pick.  With my XenServer hat on I plan to
backport the whole series into our patch queue.

Thanks, Roger.
Roger Pau Monné Feb. 11, 2025, 8:50 a.m. UTC | #5
On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote:
> On 06.02.2025 16:06, Roger Pau Monne wrote:
> > The following series aims to prevent local APIC errors from stalling the
> > shtudown process.  On XenServer testing we have seen reports of AMD
> > boxes sporadically getting stuck in a spam of:
> > 
> > APIC error on CPU0: 00(08), Receive accept error
> > 
> > Messages during shutdown, as a result of device interrupts targeting
> > CPUs that are offline (and have the local APIC disabled).
> 
> One more thought here: Have you/we perhaps discovered the reason why there
> was that 1ms delay at the end of fixup_irqs() that was badly commented,
> and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts
> to new destination in fixup_irqs()")? May be worth mentioning that by way
> of a Fixes: tag.

Hm, so you think the delay was added there as a way to ensure any
pending interrupts would get drained (ie: serviced) on the old target?

I'm maybe a bit confused, but I don't think the delay would help much
with preventing the local APIC errors?  Regardless of the wait, if the
interrupts target offline CPUs there's a chance receive accept errors
will be triggered on AMD.

Thanks, Roger.
Jan Beulich Feb. 11, 2025, 9:20 a.m. UTC | #6
On 11.02.2025 09:50, Roger Pau Monné wrote:
> On Tue, Feb 11, 2025 at 07:39:12AM +0100, Jan Beulich wrote:
>> On 06.02.2025 16:06, Roger Pau Monne wrote:
>>> The following series aims to prevent local APIC errors from stalling the
>>> shtudown process.  On XenServer testing we have seen reports of AMD
>>> boxes sporadically getting stuck in a spam of:
>>>
>>> APIC error on CPU0: 00(08), Receive accept error
>>>
>>> Messages during shutdown, as a result of device interrupts targeting
>>> CPUs that are offline (and have the local APIC disabled).
>>
>> One more thought here: Have you/we perhaps discovered the reason why there
>> was that 1ms delay at the end of fixup_irqs() that was badly commented,
>> and that you removed in e2bb28d62158 ("x86/irq: forward pending interrupts
>> to new destination in fixup_irqs()")? May be worth mentioning that by way
>> of a Fixes: tag.
> 
> Hm, so you think the delay was added there as a way to ensure any
> pending interrupts would get drained (ie: serviced) on the old target?

So far I didn't have the slightest idea why that call had been there. This
at least gives a possible reason.

> I'm maybe a bit confused, but I don't think the delay would help much
> with preventing the local APIC errors?  Regardless of the wait, if the
> interrupts target offline CPUs there's a chance receive accept errors
> will be triggered on AMD.

But fixup_irqs() right now runs ahead of actually offlining the APs.

Jan
Jan Beulich Feb. 11, 2025, 9:26 a.m. UTC | #7
On 11.02.2025 09:38, Roger Pau Monné wrote:
> On Mon, Feb 10, 2025 at 07:29:35PM +0100, Oleksii Kurochko wrote:
>> On 2/10/25 11:02 AM, Roger Pau Monné wrote:
>>> This should have had a 'for-4.20?' tag in the subject name, as
>>> otherwise we will need to add an errata to the release notes to notice
>>> that reboot can sometimes fail on AMD boxes.
>>>
>>> Also adding Oleksii.
>>>
>>> Thanks, Roger.
>>>
>>> On Thu, Feb 06, 2025 at 04:06:10PM +0100, Roger Pau Monne wrote:
>>>> Hello,
>>>>
>>>> The following series aims to prevent local APIC errors from stalling the
>>>> shtudown process.  On XenServer testing we have seen reports of AMD
>>>> boxes sporadically getting stuck in a spam of:
>>
>> How often this issue happen?
> 
> Hard to tell, we have certainly hit it more than once on XenRT, but
> I don't have figures about its probability.  I have at least 14
> reports from XenRT from the last 6 months, but there's possibly a lot
> more that could have been classified as a different kind of issue.

Our QA tells me that this is severely hindering their work.

Jan