diff mbox series

[1/2] PCI/AER: Disable AER interrupt during suspend

Message ID 20210127173101.446940-1-kai.heng.feng@canonical.com (mailing list archive)
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series [1/2] PCI/AER: Disable AER interrupt during suspend | expand

Commit Message

Kai-Heng Feng Jan. 27, 2021, 5:31 p.m. UTC
Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
hint") enables ACS, and some platforms lose its NVMe after resume from
firmware:
[   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
[   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
[   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
[   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[   50.947843] nvme nvme0: frozen state error detected, reset controller

It happens right after ACS gets enabled during resume.

To prevent that from happening, disable AER interrupt and enable it on
system suspend and resume, respectively.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Bjorn Helgaas Jan. 27, 2021, 8:50 p.m. UTC | #1
On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> hint") enables ACS, and some platforms lose its NVMe after resume from
> firmware:
> [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> 
> It happens right after ACS gets enabled during resume.
> 
> To prevent that from happening, disable AER interrupt and enable it on
> system suspend and resume, respectively.

Lots of questions here.  Maybe this is what we'll end up doing, but I
am curious about why the error is reported in the first place.

Is this a consequence of the link going down and back up?

Is it consequence of the device doing a DMA when it shouldn't?

Are we doing something in the wrong order during suspend?  Or maybe
resume, since I assume the error is reported during resume?

If we *do* take the error, why doesn't DPC recovery work?

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 77b0f2c45bc0..0e9a85530ae6 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
>  	return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_disable_rootport(rpc);
> +	return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_enable_rootport(rpc);
> +	return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
>  	.service	= PCIE_PORT_SERVICE_AER,
>  
>  	.probe		= aer_probe,
> +	.suspend	= aer_suspend,
> +	.resume		= aer_resume,
>  	.remove		= aer_remove,
>  };
>  
> -- 
> 2.29.2
>
Kai-Heng Feng Jan. 28, 2021, 4:09 a.m. UTC | #2
On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > hint") enables ACS, and some platforms lose its NVMe after resume from
> > firmware:
> > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> >
> > It happens right after ACS gets enabled during resume.
> >
> > To prevent that from happening, disable AER interrupt and enable it on
> > system suspend and resume, respectively.
>
> Lots of questions here.  Maybe this is what we'll end up doing, but I
> am curious about why the error is reported in the first place.
>
> Is this a consequence of the link going down and back up?

Could be. From the observations, it only happens when firmware suspend
(S3) is used.
Maybe it happens when it's gets powered up, but I don't have equipment
to debug at hardware level.

If we use non-firmware suspend method, enabling ACS after resume won't
trip AER and DPC.

>
> Is it consequence of the device doing a DMA when it shouldn't?

If it's doing DMA while suspending, the same error should also happen
after NVMe is suspended and before PCIe port suspending.
Furthermore, if non-firmware suspend method is used, there's so such
issue, so less likely to be any DMA operation.

>
> Are we doing something in the wrong order during suspend?  Or maybe
> resume, since I assume the error is reported during resume?

Yes the error is reported during resume. The suspend/resume order
seems fine as non-firmware suspend doesn't have this issue.

>
> If we *do* take the error, why doesn't DPC recovery work?

It works for the root port, but not for the NVMe drive:
[   50.947816] pcieport 0000:00:1b.0: DPC: containment event,
status:0x1f01 source:0x0000
[   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
[   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error:
severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
ID)
[   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error
status/mask=00200000/00010000
[   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[   50.947843] nvme nvme0: frozen state error detected, reset controller
[   50.948400] ACPI: EC: event unblocked
[   50.948432] xhci_hcd 0000:00:14.0: PME# disabled
[   50.948444] xhci_hcd 0000:00:14.0: enabling bus mastering
[   50.949056] pcieport 0000:00:1b.0: PME# disabled
[   50.949068] pcieport 0000:00:1c.0: PME# disabled
[   50.949416] e1000e 0000:00:1f.6: PME# disabled
[   50.949463] e1000e 0000:00:1f.6: enabling bus mastering
[   50.951606] sd 0:0:0:0: [sda] Starting disk
[   50.951610] nvme 0000:01:00.0: can't change power state from D3hot
to D0 (config space inaccessible)
[   50.951730] nvme nvme0: Removing after probe failure status: -19
[   50.952360] nvme nvme0: failed to set APST feature (-19)
[   50.971136] snd_hda_intel 0000:00:1f.3: PME# disabled
[   51.089330] pcieport 0000:00:1b.0: AER: broadcast resume message
[   51.089345] pcieport 0000:00:1b.0: AER: device recovery successful

But I think why recovery doesn't work for NVMe is for another discussion...

Kai-Heng

>
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 77b0f2c45bc0..0e9a85530ae6 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
> >       return 0;
> >  }
> >
> > +static int aer_suspend(struct pcie_device *dev)
> > +{
> > +     struct aer_rpc *rpc = get_service_data(dev);
> > +
> > +     aer_disable_rootport(rpc);
> > +     return 0;
> > +}
> > +
> > +static int aer_resume(struct pcie_device *dev)
> > +{
> > +     struct aer_rpc *rpc = get_service_data(dev);
> > +
> > +     aer_enable_rootport(rpc);
> > +     return 0;
> > +}
> > +
> >  /**
> >   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
> >   * @dev: pointer to Root Port, RCEC, or RCiEP
> > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
> >       .service        = PCIE_PORT_SERVICE_AER,
> >
> >       .probe          = aer_probe,
> > +     .suspend        = aer_suspend,
> > +     .resume         = aer_resume,
> >       .remove         = aer_remove,
> >  };
> >
> > --
> > 2.29.2
> >
Bjorn Helgaas Feb. 4, 2021, 11:27 p.m. UTC | #3
[+cc Alex]

On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > firmware:
> > > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> > > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> > > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > >
> > > It happens right after ACS gets enabled during resume.
> > >
> > > To prevent that from happening, disable AER interrupt and enable it on
> > > system suspend and resume, respectively.
> >
> > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > am curious about why the error is reported in the first place.
> >
> > Is this a consequence of the link going down and back up?
> 
> Could be. From the observations, it only happens when firmware suspend
> (S3) is used.
> Maybe it happens when it's gets powered up, but I don't have equipment
> to debug at hardware level.
> 
> If we use non-firmware suspend method, enabling ACS after resume won't
> trip AER and DPC.
> 
> > Is it consequence of the device doing a DMA when it shouldn't?
> 
> If it's doing DMA while suspending, the same error should also happen
> after NVMe is suspended and before PCIe port suspending.
> Furthermore, if non-firmware suspend method is used, there's so such
> issue, so less likely to be any DMA operation.
> 
> > Are we doing something in the wrong order during suspend?  Or maybe
> > resume, since I assume the error is reported during resume?
> 
> Yes the error is reported during resume. The suspend/resume order
> seems fine as non-firmware suspend doesn't have this issue.

I really feel like we need a better understanding of what's going on
here.  Disabling the AER interrupt is like closing our eyes and
pretending that because we don't see it, it didn't happen.

An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
access from the CPU wouldn't trigger this error.  And it sounds like
the error is triggered before we even start running the driver after
resume.

If we're powering up an NVMe device from D3cold and it DMAs before the
driver touches it, something would be seriously broken.  I doubt
that's what's happening.  Maybe a device could resume some previously
programmed DMA after powering up from D3hot.

Or maybe the error occurred on suspend, like if the device wasn't
quiesced or something, but we didn't notice it until resume?  The 
AER error status bits are RW1CS, which means they can be preserved
across hot/warm/cold resets.

Can you instrument the code to see whether the AER error status bit is
set before enabling ACS?  I'm not sure that merely enabling ACS (I
assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
should cause an interrupt for a previously-logged error.  I suspect
that could happen when enabling *AER*, but I wouldn't think it would
happen when enabling *ACS*.

Does this error happen on multiple machines from different vendors?
Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up
after it did something to cause an error.

> > If we *do* take the error, why doesn't DPC recovery work?
> 
> It works for the root port, but not for the NVMe drive:
> [   50.947816] pcieport 0000:00:1b.0: DPC: containment event,
> status:0x1f01 source:0x0000
> [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> ID)
> [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error
> status/mask=00200000/00010000
> [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> [   50.948400] ACPI: EC: event unblocked
> [   50.948432] xhci_hcd 0000:00:14.0: PME# disabled
> [   50.948444] xhci_hcd 0000:00:14.0: enabling bus mastering
> [   50.949056] pcieport 0000:00:1b.0: PME# disabled
> [   50.949068] pcieport 0000:00:1c.0: PME# disabled
> [   50.949416] e1000e 0000:00:1f.6: PME# disabled
> [   50.949463] e1000e 0000:00:1f.6: enabling bus mastering
> [   50.951606] sd 0:0:0:0: [sda] Starting disk
> [   50.951610] nvme 0000:01:00.0: can't change power state from D3hot
> to D0 (config space inaccessible)
> [   50.951730] nvme nvme0: Removing after probe failure status: -19
> [   50.952360] nvme nvme0: failed to set APST feature (-19)
> [   50.971136] snd_hda_intel 0000:00:1f.3: PME# disabled
> [   51.089330] pcieport 0000:00:1b.0: AER: broadcast resume message
> [   51.089345] pcieport 0000:00:1b.0: AER: device recovery successful
> 
> But I think why recovery doesn't work for NVMe is for another discussion...
> 
> Kai-Heng
> 
> >
> > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> > > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > >  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
> > >  1 file changed, 18 insertions(+)
> > >
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index 77b0f2c45bc0..0e9a85530ae6 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
> > >       return 0;
> > >  }
> > >
> > > +static int aer_suspend(struct pcie_device *dev)
> > > +{
> > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > +
> > > +     aer_disable_rootport(rpc);
> > > +     return 0;
> > > +}
> > > +
> > > +static int aer_resume(struct pcie_device *dev)
> > > +{
> > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > +
> > > +     aer_enable_rootport(rpc);
> > > +     return 0;
> > > +}
> > > +
> > >  /**
> > >   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
> > >   * @dev: pointer to Root Port, RCEC, or RCiEP
> > > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
> > >       .service        = PCIE_PORT_SERVICE_AER,
> > >
> > >       .probe          = aer_probe,
> > > +     .suspend        = aer_suspend,
> > > +     .resume         = aer_resume,
> > >       .remove         = aer_remove,
> > >  };
> > >
> > > --
> > > 2.29.2
> > >
Kai-Heng Feng Feb. 5, 2021, 3:17 p.m. UTC | #4
On Fri, Feb 5, 2021 at 7:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Alex]
>
> On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> > On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > > firmware:
> > > > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> > > > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > > > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > > > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> > > > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > > > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > > >
> > > > It happens right after ACS gets enabled during resume.
> > > >
> > > > To prevent that from happening, disable AER interrupt and enable it on
> > > > system suspend and resume, respectively.
> > >
> > > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > > am curious about why the error is reported in the first place.
> > >
> > > Is this a consequence of the link going down and back up?
> >
> > Could be. From the observations, it only happens when firmware suspend
> > (S3) is used.
> > Maybe it happens when it's gets powered up, but I don't have equipment
> > to debug at hardware level.
> >
> > If we use non-firmware suspend method, enabling ACS after resume won't
> > trip AER and DPC.
> >
> > > Is it consequence of the device doing a DMA when it shouldn't?
> >
> > If it's doing DMA while suspending, the same error should also happen
> > after NVMe is suspended and before PCIe port suspending.
> > Furthermore, if non-firmware suspend method is used, there's so such
> > issue, so less likely to be any DMA operation.
> >
> > > Are we doing something in the wrong order during suspend?  Or maybe
> > > resume, since I assume the error is reported during resume?
> >
> > Yes the error is reported during resume. The suspend/resume order
> > seems fine as non-firmware suspend doesn't have this issue.
>
> I really feel like we need a better understanding of what's going on
> here.  Disabling the AER interrupt is like closing our eyes and
> pretending that because we don't see it, it didn't happen.
>
> An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
> access from the CPU wouldn't trigger this error.  And it sounds like
> the error is triggered before we even start running the driver after
> resume.
>
> If we're powering up an NVMe device from D3cold and it DMAs before the
> driver touches it, something would be seriously broken.  I doubt
> that's what's happening.  Maybe a device could resume some previously
> programmed DMA after powering up from D3hot.

I am not that familiar with PCIe ACS/AER/DPC, so I can't really answer
questions you raised.
PCIe spec doesn't say the suspend/resume order is also not helping here.

However, I really think it's a system firmware issue.
I've seen some suspend-to-idle platforms with NVMe can reach D3cold,
those are unaffected.

>
> Or maybe the error occurred on suspend, like if the device wasn't
> quiesced or something, but we didn't notice it until resume?  The
> AER error status bits are RW1CS, which means they can be preserved
> across hot/warm/cold resets.
>
> Can you instrument the code to see whether the AER error status bit is
> set before enabling ACS?  I'm not sure that merely enabling ACS (I
> assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
> should cause an interrupt for a previously-logged error.  I suspect
> that could happen when enabling *AER*, but I wouldn't think it would
> happen when enabling *ACS*.

Diff to print AER status:
https://bugzilla.kernel.org/show_bug.cgi?id=209149#c11

And dmesg:
https://bugzilla.kernel.org/show_bug.cgi?id=209149#c12

Looks like the read before suspend and after resume are both fine.

>
> Does this error happen on multiple machines from different vendors?
> Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up
> after it did something to cause an error.

AFAIK, systems from both HP and Dell are affected.
I was told that the reference platform from Intel is using
suspend-to-idle, but vendors changed the sleep method to S3 to have
lower power consumption to pass regulation.

Kai-Heng

>
> > > If we *do* take the error, why doesn't DPC recovery work?
> >
> > It works for the root port, but not for the NVMe drive:
> > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event,
> > status:0x1f01 source:0x0000
> > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error:
> > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> > ID)
> > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error
> > status/mask=00200000/00010000
> > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > [   50.948400] ACPI: EC: event unblocked
> > [   50.948432] xhci_hcd 0000:00:14.0: PME# disabled
> > [   50.948444] xhci_hcd 0000:00:14.0: enabling bus mastering
> > [   50.949056] pcieport 0000:00:1b.0: PME# disabled
> > [   50.949068] pcieport 0000:00:1c.0: PME# disabled
> > [   50.949416] e1000e 0000:00:1f.6: PME# disabled
> > [   50.949463] e1000e 0000:00:1f.6: enabling bus mastering
> > [   50.951606] sd 0:0:0:0: [sda] Starting disk
> > [   50.951610] nvme 0000:01:00.0: can't change power state from D3hot
> > to D0 (config space inaccessible)
> > [   50.951730] nvme nvme0: Removing after probe failure status: -19
> > [   50.952360] nvme nvme0: failed to set APST feature (-19)
> > [   50.971136] snd_hda_intel 0000:00:1f.3: PME# disabled
> > [   51.089330] pcieport 0000:00:1b.0: AER: broadcast resume message
> > [   51.089345] pcieport 0000:00:1b.0: AER: device recovery successful
> >
> > But I think why recovery doesn't work for NVMe is for another discussion...
> >
> > Kai-Heng
> >
> > >
> > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> > > > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > ---
> > > >  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > > index 77b0f2c45bc0..0e9a85530ae6 100644
> > > > --- a/drivers/pci/pcie/aer.c
> > > > +++ b/drivers/pci/pcie/aer.c
> > > > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
> > > >       return 0;
> > > >  }
> > > >
> > > > +static int aer_suspend(struct pcie_device *dev)
> > > > +{
> > > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > > +
> > > > +     aer_disable_rootport(rpc);
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int aer_resume(struct pcie_device *dev)
> > > > +{
> > > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > > +
> > > > +     aer_enable_rootport(rpc);
> > > > +     return 0;
> > > > +}
> > > > +
> > > >  /**
> > > >   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
> > > >   * @dev: pointer to Root Port, RCEC, or RCiEP
> > > > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
> > > >       .service        = PCIE_PORT_SERVICE_AER,
> > > >
> > > >       .probe          = aer_probe,
> > > > +     .suspend        = aer_suspend,
> > > > +     .resume         = aer_resume,
> > > >       .remove         = aer_remove,
> > > >  };
> > > >
> > > > --
> > > > 2.29.2
> > > >
Bjorn Helgaas July 22, 2021, 10:23 p.m. UTC | #5
On Fri, Feb 05, 2021 at 11:17:32PM +0800, Kai-Heng Feng wrote:
> On Fri, Feb 5, 2021 at 7:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Alex]
> >
> > On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> > > On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > > > firmware:
> > > > > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> > > > > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > > > > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > > > > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> > > > > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > > > > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > > > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > > > >
> > > > > It happens right after ACS gets enabled during resume.
> > > > >
> > > > > To prevent that from happening, disable AER interrupt and enable it on
> > > > > system suspend and resume, respectively.
> > > >
> > > > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > > > am curious about why the error is reported in the first place.
> > > >
> > > > Is this a consequence of the link going down and back up?
> > >
> > > Could be. From the observations, it only happens when firmware suspend
> > > (S3) is used.
> > > Maybe it happens when it's gets powered up, but I don't have equipment
> > > to debug at hardware level.
> > >
> > > If we use non-firmware suspend method, enabling ACS after resume won't
> > > trip AER and DPC.
> > >
> > > > Is it consequence of the device doing a DMA when it shouldn't?
> > >
> > > If it's doing DMA while suspending, the same error should also happen
> > > after NVMe is suspended and before PCIe port suspending.
> > > Furthermore, if non-firmware suspend method is used, there's so such
> > > issue, so less likely to be any DMA operation.
> > >
> > > > Are we doing something in the wrong order during suspend?  Or maybe
> > > > resume, since I assume the error is reported during resume?
> > >
> > > Yes the error is reported during resume. The suspend/resume order
> > > seems fine as non-firmware suspend doesn't have this issue.
> >
> > I really feel like we need a better understanding of what's going on
> > here.  Disabling the AER interrupt is like closing our eyes and
> > pretending that because we don't see it, it didn't happen.
> >
> > An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
> > access from the CPU wouldn't trigger this error.  And it sounds like
> > the error is triggered before we even start running the driver after
> > resume.
> >
> > If we're powering up an NVMe device from D3cold and it DMAs before the
> > driver touches it, something would be seriously broken.  I doubt
> > that's what's happening.  Maybe a device could resume some previously
> > programmed DMA after powering up from D3hot.
> 
> I am not that familiar with PCIe ACS/AER/DPC, so I can't really answer
> questions you raised.
> PCIe spec doesn't say the suspend/resume order is also not helping here.
> 
> However, I really think it's a system firmware issue.
> I've seen some suspend-to-idle platforms with NVMe can reach D3cold,
> those are unaffected.

Marking both of these as "not applicable" for now because I don't
think we really understand what's going on.

Apparently a DMA occurs during suspend or resume and triggers an ACS
violation.  I don't think think such a DMA should occur in the first
place.

Or maybe, since you say the problem happens right after ACS is enabled
during resume, we're doing the ACS enable incorrectly?  Although I
would think we should not be doing DMA at the same time we're enabling
ACS, either.

If this really is a system firmware issue, both HP and Dell should
have the knowledge and equipment to figure out what's going on.

> > Or maybe the error occurred on suspend, like if the device wasn't
> > quiesced or something, but we didn't notice it until resume?  The
> > AER error status bits are RW1CS, which means they can be preserved
> > across hot/warm/cold resets.
> >
> > Can you instrument the code to see whether the AER error status bit is
> > set before enabling ACS?  I'm not sure that merely enabling ACS (I
> > assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
> > should cause an interrupt for a previously-logged error.  I suspect
> > that could happen when enabling *AER*, but I wouldn't think it would
> > happen when enabling *ACS*.
> 
> Diff to print AER status:
> https://bugzilla.kernel.org/show_bug.cgi?id=209149#c11
> 
> And dmesg:
> https://bugzilla.kernel.org/show_bug.cgi?id=209149#c12
> 
> Looks like the read before suspend and after resume are both fine.
> 
> >
> > Does this error happen on multiple machines from different vendors?
> > Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up
> > after it did something to cause an error.
> 
> AFAIK, systems from both HP and Dell are affected.
> I was told that the reference platform from Intel is using
> suspend-to-idle, but vendors changed the sleep method to S3 to have
> lower power consumption to pass regulation.
> 
> Kai-Heng
> 
> >
> > > > If we *do* take the error, why doesn't DPC recovery work?
> > >
> > > It works for the root port, but not for the NVMe drive:
> > > [   50.947816] pcieport 0000:00:1b.0: DPC: containment event,
> > > status:0x1f01 source:0x0000
> > > [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> > > [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error:
> > > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> > > ID)
> > > [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error
> > > status/mask=00200000/00010000
> > > [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> > > [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > > [   50.948400] ACPI: EC: event unblocked
> > > [   50.948432] xhci_hcd 0000:00:14.0: PME# disabled
> > > [   50.948444] xhci_hcd 0000:00:14.0: enabling bus mastering
> > > [   50.949056] pcieport 0000:00:1b.0: PME# disabled
> > > [   50.949068] pcieport 0000:00:1c.0: PME# disabled
> > > [   50.949416] e1000e 0000:00:1f.6: PME# disabled
> > > [   50.949463] e1000e 0000:00:1f.6: enabling bus mastering
> > > [   50.951606] sd 0:0:0:0: [sda] Starting disk
> > > [   50.951610] nvme 0000:01:00.0: can't change power state from D3hot
> > > to D0 (config space inaccessible)
> > > [   50.951730] nvme nvme0: Removing after probe failure status: -19
> > > [   50.952360] nvme nvme0: failed to set APST feature (-19)
> > > [   50.971136] snd_hda_intel 0000:00:1f.3: PME# disabled
> > > [   51.089330] pcieport 0000:00:1b.0: AER: broadcast resume message
> > > [   51.089345] pcieport 0000:00:1b.0: AER: device recovery successful
> > >
> > > But I think why recovery doesn't work for NVMe is for another discussion...
> > >
> > > Kai-Heng
> > >
> > > >
> > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> > > > > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> > > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > ---
> > > > >  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
> > > > >  1 file changed, 18 insertions(+)
> > > > >
> > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > > > index 77b0f2c45bc0..0e9a85530ae6 100644
> > > > > --- a/drivers/pci/pcie/aer.c
> > > > > +++ b/drivers/pci/pcie/aer.c
> > > > > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static int aer_suspend(struct pcie_device *dev)
> > > > > +{
> > > > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > > > +
> > > > > +     aer_disable_rootport(rpc);
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > > +static int aer_resume(struct pcie_device *dev)
> > > > > +{
> > > > > +     struct aer_rpc *rpc = get_service_data(dev);
> > > > > +
> > > > > +     aer_enable_rootport(rpc);
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
> > > > >   * @dev: pointer to Root Port, RCEC, or RCiEP
> > > > > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
> > > > >       .service        = PCIE_PORT_SERVICE_AER,
> > > > >
> > > > >       .probe          = aer_probe,
> > > > > +     .suspend        = aer_suspend,
> > > > > +     .resume         = aer_resume,
> > > > >       .remove         = aer_remove,
> > > > >  };
> > > > >
> > > > > --
> > > > > 2.29.2
> > > > >
Christoph Hellwig July 23, 2021, 5:24 a.m. UTC | #6
On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> Marking both of these as "not applicable" for now because I don't
> think we really understand what's going on.
> 
> Apparently a DMA occurs during suspend or resume and triggers an ACS
> violation.  I don't think think such a DMA should occur in the first
> place.
> 
> Or maybe, since you say the problem happens right after ACS is enabled
> during resume, we're doing the ACS enable incorrectly?  Although I
> would think we should not be doing DMA at the same time we're enabling
> ACS, either.
> 
> If this really is a system firmware issue, both HP and Dell should
> have the knowledge and equipment to figure out what's going on.

DMA on resume sounds really odd.  OTOH the below mentioned case of
a DMA during suspend seems very like in some setup.  NVMe has the
concept of a host memory buffer (HMB) that allows the PCIe device
to use arbitrary host memory for internal purposes.  Combine this
with the "Storage D3" misfeature in modern x86 platforms that force
a slot into d3cold without consulting the driver first and you'd see
symptoms like this.  Another case would be the NVMe equivalent of the
AER which could lead to a completion without host activity.

We now have quirks in the ACPI layer and NVMe to fully shut down the
NVMe controllers on these messed up systems with the "Storage D3"
misfeature which should avoid such "spurious" DMAs at the cost of
wearning out the device much faster.
Kai-Heng Feng July 23, 2021, 7:05 a.m. UTC | #7
On Fri, Jul 23, 2021 at 1:24 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> > Marking both of these as "not applicable" for now because I don't
> > think we really understand what's going on.
> >
> > Apparently a DMA occurs during suspend or resume and triggers an ACS
> > violation.  I don't think think such a DMA should occur in the first
> > place.
> >
> > Or maybe, since you say the problem happens right after ACS is enabled
> > during resume, we're doing the ACS enable incorrectly?  Although I
> > would think we should not be doing DMA at the same time we're enabling
> > ACS, either.
> >
> > If this really is a system firmware issue, both HP and Dell should
> > have the knowledge and equipment to figure out what's going on.
>
> DMA on resume sounds really odd.  OTOH the below mentioned case of
> a DMA during suspend seems very like in some setup.  NVMe has the
> concept of a host memory buffer (HMB) that allows the PCIe device
> to use arbitrary host memory for internal purposes.  Combine this
> with the "Storage D3" misfeature in modern x86 platforms that force
> a slot into d3cold without consulting the driver first and you'd see
> symptoms like this.  Another case would be the NVMe equivalent of the
> AER which could lead to a completion without host activity.

The issue can also be observed on non-HMB NVMe.

>
> We now have quirks in the ACPI layer and NVMe to fully shut down the
> NVMe controllers on these messed up systems with the "Storage D3"
> misfeature which should avoid such "spurious" DMAs at the cost of
> wearning out the device much faster.

Since the issue is on S3, I think the NVMe always fully shuts down.

Kai-Heng
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 77b0f2c45bc0..0e9a85530ae6 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1365,6 +1365,22 @@  static int aer_probe(struct pcie_device *dev)
 	return 0;
 }
 
+static int aer_suspend(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_disable_rootport(rpc);
+	return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_enable_rootport(rpc);
+	return 0;
+}
+
 /**
  * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
  * @dev: pointer to Root Port, RCEC, or RCiEP
@@ -1437,6 +1453,8 @@  static struct pcie_port_service_driver aerdriver = {
 	.service	= PCIE_PORT_SERVICE_AER,
 
 	.probe		= aer_probe,
+	.suspend	= aer_suspend,
+	.resume		= aer_resume,
 	.remove		= aer_remove,
 };