Message ID | 20220105060643.822111-1-kai.heng.feng@canonical.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | PCI/portdrv: Skip enabling AER on external facing ports | expand |
On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote: > The Thunderbolt root ports may constantly spew out uncorrected errors > from AER service: > [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) > [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 > [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) > [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) > [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed No timestamps needed here; they don't add to understanding the problem. > The link may not be reliable on external facing ports, so don't enable > AER on those ports. I'm not sure what you want to accomplish here. If the errors are legitimate and the result of some hardware issue like a bad cable, why should we ignore them? If they're caused by a software problem, we should figure that out and fix it. Does this occur on a specific instance of possibly flaky hardware? You mention a spew of errors; do you think this is a single error that we fail to clear correctly? Or is it really many separate errors? > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453 > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > --- > drivers/pci/pcie/portdrv_core.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c > index bda630889f955..d464d00ade8f2 100644 > --- a/drivers/pci/pcie/portdrv_core.c > +++ b/drivers/pci/pcie/portdrv_core.c > @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev) > > #ifdef CONFIG_PCIEAER > if (dev->aer_cap && pci_aer_available() && > - (pcie_ports_native || host->native_aer)) { > + (pcie_ports_native || host->native_aer) && > + !dev->external_facing) { > services |= PCIE_PORT_SERVICE_AER; > > /* > -- > 2.33.1 >
On Thu, Jan 6, 2022 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote: > > The Thunderbolt root ports may constantly spew out uncorrected errors > > from AER service: > > [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > > [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > > [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > > [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) > > [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 > > [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) > > [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) > > [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed > > No timestamps needed here; they don't add to understanding the > problem. Got it. Will remove it for later iteration. > > > The link may not be reliable on external facing ports, so don't enable > > AER on those ports. > > I'm not sure what you want to accomplish here. If the errors are > legitimate and the result of some hardware issue like a bad cable, why > should we ignore them? If they're caused by a software problem, we > should figure that out and fix it. > > Does this occur on a specific instance of possibly flaky hardware? Only from root ports of thunderbolt devices. The error occurs as soon as the root port is runtime suspended to D3cold. Runtime suspend the AER service can resolve the issue. I wonder if it's the right thing to do here? D3cold should also mean the PCI link is gone, disabling AER seems to be a reasonable approach. Kai-Heng > > You mention a spew of errors; do you think this is a single error that > we fail to clear correctly? Or is it really many separate errors? > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453 > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > --- > > drivers/pci/pcie/portdrv_core.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c > > index bda630889f955..d464d00ade8f2 100644 > > --- a/drivers/pci/pcie/portdrv_core.c > > +++ b/drivers/pci/pcie/portdrv_core.c > > @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev) > > > > #ifdef CONFIG_PCIEAER > > if (dev->aer_cap && pci_aer_available() && > > - (pcie_ports_native || host->native_aer)) { > > + (pcie_ports_native || host->native_aer) && > > + !dev->external_facing) { > > services |= PCIE_PORT_SERVICE_AER; > > > > /* > > -- > > 2.33.1 > >
Hi Kai-Heng, On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote: > Only from root ports of thunderbolt devices. > > The error occurs as soon as the root port is runtime suspended to D3cold. > > Runtime suspend the AER service can resolve the issue. I wonder if > it's the right thing to do here? I think you are right here. It seems that AER "service driver" is completely missing PM hooks. Probably because it is more used in server type of systems where power management is not priority. > D3cold should also mean the PCI link is gone, disabling AER seems to > be a reasonable approach. Indeed - I think AER might trigger here because the link does "down" / low power state if left enabled while the root port enters D3. Something like below hack should disable it over low power transitions: diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..64138cf82db8 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; } +static int aer_suspend(struct pcie_device *dev) +{ + struct aer_rpc *rpc = get_service_data(dev); + + aer_disable_rootport(rpc); + return 0; +} + +static int aer_resume(struct pcie_device *dev) +{ + struct aer_rpc *rpc = get_service_data(dev); + + aer_enable_rootport(rpc); + return 0; +} + static struct pcie_port_service_driver aerdriver = { .name = "aer", .port_type = PCIE_ANY_PORT, @@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = { .probe = aer_probe, .remove = aer_remove, + .suspend = aer_suspend, + .resume = aer_resume, + .runtime_suspend = aer_suspend, + .runtime_resume = aer_resume, }; /**
Hi Mika, On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > > Hi Kai-Heng, > > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote: > > Only from root ports of thunderbolt devices. > > > > The error occurs as soon as the root port is runtime suspended to D3cold. > > > > Runtime suspend the AER service can resolve the issue. I wonder if > > it's the right thing to do here? > > I think you are right here. It seems that AER "service driver" is > completely missing PM hooks. Probably because it is more used in server > type of systems where power management is not priority. Here is my previous attempt to suspend AER: https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/ > > > D3cold should also mean the PCI link is gone, disabling AER seems to > > be a reasonable approach. > > Indeed - I think AER might trigger here because the link does "down" / > low power state if left enabled while the root port enters D3. Something > like below hack should disable it over low power transitions: Ubuntu kernel has been carrying the patch for quite some time: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/commit/?id=e82f15f1a26273b004054a81ef45937fb1b632e5 > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 9fa1f97e5b27..64138cf82db8 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) > return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; > } > > +static int aer_suspend(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_disable_rootport(rpc); > + return 0; > +} > + > +static int aer_resume(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_enable_rootport(rpc); > + return 0; > +} > + > static struct pcie_port_service_driver aerdriver = { > .name = "aer", > .port_type = PCIE_ANY_PORT, > @@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = { > > .probe = aer_probe, > .remove = aer_remove, > + .suspend = aer_suspend, > + .resume = aer_resume, > + .runtime_suspend = aer_suspend, > + .runtime_resume = aer_resume, > }; This patch is exactly what I tested. Maybe only suspend/runtime_suspend AER when the target PM state is D3cold? PCIe spec doesn't say how to handle AER in Link L2/L3Ready/L3, but I think it's reasonable to suspend AER when power is loss. Let me come up with a patch with that idea. Kai-Heng > > /**
Hi, On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote: > Hi Mika, > > On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg > <mika.westerberg@linux.intel.com> wrote: > > > > Hi Kai-Heng, > > > > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote: > > > Only from root ports of thunderbolt devices. > > > > > > The error occurs as soon as the root port is runtime suspended to D3cold. > > > > > > Runtime suspend the AER service can resolve the issue. I wonder if > > > it's the right thing to do here? > > > > I think you are right here. It seems that AER "service driver" is > > completely missing PM hooks. Probably because it is more used in server > > type of systems where power management is not priority. > > Here is my previous attempt to suspend AER: > https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/ That's great! I think we should do the same for runtime PM paths too, though. Will you take care of that as well? :)
On Fri, Jan 21, 2022 at 8:44 PM Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > > Hi, > > On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote: > > Hi Mika, > > > > On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg > > <mika.westerberg@linux.intel.com> wrote: > > > > > > Hi Kai-Heng, > > > > > > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote: > > > > Only from root ports of thunderbolt devices. > > > > > > > > The error occurs as soon as the root port is runtime suspended to D3cold. > > > > > > > > Runtime suspend the AER service can resolve the issue. I wonder if > > > > it's the right thing to do here? > > > > > > I think you are right here. It seems that AER "service driver" is > > > completely missing PM hooks. Probably because it is more used in server > > > type of systems where power management is not priority. > > > > Here is my previous attempt to suspend AER: > > https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/ > > That's great! > > I think we should do the same for runtime PM paths too, though. Will you > take care of that as well? :) Yes that's the plan. I hope I can persuade Bjorn this time... Kai-Heng
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index bda630889f955..d464d00ade8f2 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev) #ifdef CONFIG_PCIEAER if (dev->aer_cap && pci_aer_available() && - (pcie_ports_native || host->native_aer)) { + (pcie_ports_native || host->native_aer) && + !dev->external_facing) { services |= PCIE_PORT_SERVICE_AER; /*
The Thunderbolt root ports may constantly spew out uncorrected errors from AER service: [ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0 [ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 [ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First) [ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000 [ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback) [ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed The link may not be reliable on external facing ports, so don't enable AER on those ports. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> --- drivers/pci/pcie/portdrv_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)