Message ID | 20230714050541.2765246-1-kai.heng.feng@canonical.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | [v2] igc: Ignore AER reset when device is suspended | expand |
On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: > When a system that connects to a Thunderbolt dock equipped with I225, > like HP Thunderbolt Dock G4, I225 stops working after S3 resume: > ... > The issue is that the PTM requests are sending before driver resumes the > device. Since the issue can also be observed on Windows, it's quite > likely a firmware/hardware limitation. Does this mean we didn't disable PTM correctly on suspend? Or is the device defective and sending PTM requests even though PTM is disabled? If the latter, I vote for a quirk that just disables PTM completely for this device. This check in .error_detected() looks out of place to me because there's no connection between AER and PTM, there's no connection between PTM and the device being enabled, and the connection between the device being enabled and being fully resumed is a little tenuous. If we must do it this way, maybe add a comment about *why* we're checking pci_is_enabled(). Otherwise this will be copied to other drivers that don't need it. > So avoid resetting the device if it's not resumed. Once the device is > fully resumed, the device can work normally. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216850 > Reviewed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > --- > v2: > - Fix typo. > - Mention the product name. > > drivers/net/ethernet/intel/igc/igc_main.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c > index 9f93f0f4f752..8c36bbe5e428 100644 > --- a/drivers/net/ethernet/intel/igc/igc_main.c > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > @@ -7115,6 +7115,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, > struct net_device *netdev = pci_get_drvdata(pdev); > struct igc_adapter *adapter = netdev_priv(netdev); > > + if (!pci_is_enabled(pdev)) > + return 0; > + > netif_device_detach(netdev); > > if (state == pci_channel_io_perm_failure) > -- > 2.34.1 >
Bjorn Helgaas <helgaas@kernel.org> writes: > On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: >> When a system that connects to a Thunderbolt dock equipped with I225, >> like HP Thunderbolt Dock G4, I225 stops working after S3 resume: >> ... > >> The issue is that the PTM requests are sending before driver resumes the >> device. Since the issue can also be observed on Windows, it's quite >> likely a firmware/hardware limitation. > > Does this mean we didn't disable PTM correctly on suspend? Or is the > device defective and sending PTM requests even though PTM is disabled? > The way I understand the hardware bug, the device is defective, as you said, the device sends PTM messages when "busmastering" is disabled. > If the latter, I vote for a quirk that just disables PTM completely > for this device. > My suggestion is that adding this quirk would be a last resort kind of thing. There are users/customers that depend on the increased time synchronization accuracy that PTM provides. > This check in .error_detected() looks out of place to me because > there's no connection between AER and PTM, there's no connection > between PTM and the device being enabled, and the connection between > the device being enabled and being fully resumed is a little tenuous. > > If we must do it this way, maybe add a comment about *why* we're > checking pci_is_enabled(). Otherwise this will be copied to other > drivers that don't need it. Makes total sense, from my side. > >> So avoid resetting the device if it's not resumed. Once the device is >> fully resumed, the device can work normally. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216850 >> Reviewed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> >> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> >> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> >> >> --- >> v2: >> - Fix typo. >> - Mention the product name. >> >> drivers/net/ethernet/intel/igc/igc_main.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c >> index 9f93f0f4f752..8c36bbe5e428 100644 >> --- a/drivers/net/ethernet/intel/igc/igc_main.c >> +++ b/drivers/net/ethernet/intel/igc/igc_main.c >> @@ -7115,6 +7115,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, >> struct net_device *netdev = pci_get_drvdata(pdev); >> struct igc_adapter *adapter = netdev_priv(netdev); >> >> + if (!pci_is_enabled(pdev)) >> + return 0; >> + >> netif_device_detach(netdev); >> >> if (state == pci_channel_io_perm_failure) >> -- >> 2.34.1 >> >
On Fri, Jul 14, 2023 at 01:35:55PM -0700, Vinicius Costa Gomes wrote: > Bjorn Helgaas <helgaas@kernel.org> writes: > > On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: > >> When a system that connects to a Thunderbolt dock equipped with I225, > >> like HP Thunderbolt Dock G4, I225 stops working after S3 resume: > >> ... > > > >> The issue is that the PTM requests are sending before driver resumes the > >> device. Since the issue can also be observed on Windows, it's quite > >> likely a firmware/hardware limitation. > > > > Does this mean we didn't disable PTM correctly on suspend? Or is the > > device defective and sending PTM requests even though PTM is disabled? > > The way I understand the hardware bug, the device is defective, as you > said, the device sends PTM messages when "busmastering" is disabled. Bus Master Enable controls the ability of a Function to issue Memory and I/O Read/Write Requests (PCIe r6.0, sec 7.5.1.1.3). PTM uses Messages, and I don't think they should be affected by Bus Master Enable. I also don't understand the I225 connection. We have these Uncorrected Non-Fatal errors: > >> [ 606.527931] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > >> [ 606.528064] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > >> [ 606.528068] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > >> [ 606.528072] pcieport 0000:00:1d.0: [20] UnsupReq (First) > >> [ 606.528075] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 0a000052 00000000 00000000 > >> [ 606.528079] pcieport 0000:00:1d.0: AER: Error of this Agent is reported first > >> [ 606.528098] pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > >> [ 606.528101] pcieport 0000:04:01.0: device [8086:1136] error status/mask=00300000/00000000 > >> [ 606.528105] pcieport 0000:04:01.0: [20] UnsupReq (First) > >> [ 606.528107] pcieport 0000:04:01.0: [21] ACSViol > >> [ 606.528110] pcieport 0000:04:01.0: AER: TLP Header: 34000000 04000052 00000000 00000000 They are clearly Unsupported Request errors caused by PTM Requests (decoding at https://bugzilla.kernel.org/show_bug.cgi?id=216850#c9), but they were logged by 00:1d.0 and 04:01.0. The hierarchy is this: 00:1d.0 Root Port to [bus 03-6c] 03:00.0 Switch Upstream Port to [bus 04-6c] 04:01.0 Switch Downstream Port to [bus 06-38] 06:00.0 Switch Upstream Port to [bus 07-38] 07:04.0 Switch Downstream Port to [bus 38] 38:00.0 igc I225 NIC If I225 sent a PTM request when it shouldn't have, i.e., when 07:04.0 didn't have PTM enabled, the error would have been logged by 07:04.0. The fact that the errors were logged by 00:1d.0 and 04:01.0 means that they were caused by PTM requests from 03:00.0 and 06:00.0. Bjorn
[+Cc Aaron] On Fri, Jul 14, 2023 at 10:54 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: > > When a system that connects to a Thunderbolt dock equipped with I225, > > like HP Thunderbolt Dock G4, I225 stops working after S3 resume: > > ... > > > The issue is that the PTM requests are sending before driver resumes the > > device. Since the issue can also be observed on Windows, it's quite > > likely a firmware/hardware limitation. > > Does this mean we didn't disable PTM correctly on suspend? Or is the PTM gets disabled correctly during suspend, by commit c01163dbd1b8 ("PCI/PM: Always disable PTM for all devices during suspend"). Before that commit the suspend will fail. > device defective and sending PTM requests even though PTM is disabled? Yes. When S3 resume, I guess the firmware resets the dock and/or I225 so PTM request starts even before the OS is resumed. AFAIK the issue doesn't happen when s2Idle is used. > > If the latter, I vote for a quirk that just disables PTM completely > for this device. The S3 resume enables PTM regardless of OS involvement. So I don't think this will work. > > This check in .error_detected() looks out of place to me because > there's no connection between AER and PTM, there's no connection > between PTM and the device being enabled, and the connection between > the device being enabled and being fully resumed is a little tenuous. True. This patch is just a workaround. Have you considered my other proposed approach? Like disable AER completely during suspend, or even defer the resuming of PCIe services after the entire hierarchy is resumed? > > If we must do it this way, maybe add a comment about *why* we're > checking pci_is_enabled(). Otherwise this will be copied to other > drivers that don't need it. Sure. Kai-Heng > > > So avoid resetting the device if it's not resumed. Once the device is > > fully resumed, the device can work normally. > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216850 > > Reviewed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > > > --- > > v2: > > - Fix typo. > > - Mention the product name. > > > > drivers/net/ethernet/intel/igc/igc_main.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c > > index 9f93f0f4f752..8c36bbe5e428 100644 > > --- a/drivers/net/ethernet/intel/igc/igc_main.c > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > > @@ -7115,6 +7115,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, > > struct net_device *netdev = pci_get_drvdata(pdev); > > struct igc_adapter *adapter = netdev_priv(netdev); > > > > + if (!pci_is_enabled(pdev)) > > + return 0; > > + > > netif_device_detach(netdev); > > > > if (state == pci_channel_io_perm_failure) > > -- > > 2.34.1 > >
On Sun, Jul 16, 2023 at 3:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Jul 14, 2023 at 01:35:55PM -0700, Vinicius Costa Gomes wrote: > > Bjorn Helgaas <helgaas@kernel.org> writes: > > > On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: > > >> When a system that connects to a Thunderbolt dock equipped with I225, > > >> like HP Thunderbolt Dock G4, I225 stops working after S3 resume: > > >> ... > > > > > >> The issue is that the PTM requests are sending before driver resumes the > > >> device. Since the issue can also be observed on Windows, it's quite > > >> likely a firmware/hardware limitation. > > > > > > Does this mean we didn't disable PTM correctly on suspend? Or is the > > > device defective and sending PTM requests even though PTM is disabled? > > > > The way I understand the hardware bug, the device is defective, as you > > said, the device sends PTM messages when "busmastering" is disabled. > > Bus Master Enable controls the ability of a Function to issue Memory > and I/O Read/Write Requests (PCIe r6.0, sec 7.5.1.1.3). PTM uses > Messages, and I don't think they should be affected by Bus Master > Enable. > > I also don't understand the I225 connection. We have these > Uncorrected Non-Fatal errors: > > > >> [ 606.527931] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0 > > >> [ 606.528064] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > > >> [ 606.528068] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 > > >> [ 606.528072] pcieport 0000:00:1d.0: [20] UnsupReq (First) > > >> [ 606.528075] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 0a000052 00000000 00000000 > > >> [ 606.528079] pcieport 0000:00:1d.0: AER: Error of this Agent is reported first > > >> [ 606.528098] pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) > > >> [ 606.528101] pcieport 0000:04:01.0: device [8086:1136] error status/mask=00300000/00000000 > > >> [ 606.528105] pcieport 0000:04:01.0: [20] UnsupReq (First) > > >> [ 606.528107] pcieport 0000:04:01.0: [21] ACSViol > > >> [ 606.528110] pcieport 0000:04:01.0: AER: TLP Header: 34000000 04000052 00000000 00000000 > > They are clearly Unsupported Request errors caused by PTM Requests > (decoding at https://bugzilla.kernel.org/show_bug.cgi?id=216850#c9), > but they were logged by 00:1d.0 and 04:01.0. > > The hierarchy is this: > > 00:1d.0 Root Port to [bus 03-6c] > 03:00.0 Switch Upstream Port to [bus 04-6c] > 04:01.0 Switch Downstream Port to [bus 06-38] > 06:00.0 Switch Upstream Port to [bus 07-38] > 07:04.0 Switch Downstream Port to [bus 38] > 38:00.0 igc I225 NIC > > If I225 sent a PTM request when it shouldn't have, i.e., when 07:04.0 > didn't have PTM enabled, the error would have been logged by 07:04.0. > > The fact that the errors were logged by 00:1d.0 and 04:01.0 means that > they were caused by PTM requests from 03:00.0 and 06:00.0. OK, so the PTM is actually fired by the Thunderbolt switch. That means the I225 reset is collateral damage. Let me see if I can reproduce the UR PTM with other devices. Kai-Heng > > Bjorn
On Mon, Jul 17, 2023 at 03:38:09PM +0800, Kai-Heng Feng wrote: > On Fri, Jul 14, 2023 at 10:54 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Jul 14, 2023 at 01:05:41PM +0800, Kai-Heng Feng wrote: > > > When a system that connects to a Thunderbolt dock equipped with I225, > > > like HP Thunderbolt Dock G4, I225 stops working after S3 resume: > > > ... > > > > > The issue is that the PTM requests are sending before driver resumes the > > > device. Since the issue can also be observed on Windows, it's quite > > > likely a firmware/hardware limitation. > > > > Does this mean we didn't disable PTM correctly on suspend? Or is the > > PTM gets disabled correctly during suspend, by commit c01163dbd1b8 > ("PCI/PM: Always disable PTM for all devices during suspend"). > Before that commit the suspend will fail. > > > device defective and sending PTM requests even though PTM is disabled? > > Yes. When S3 resume, I guess the firmware resets the dock and/or I225 > so PTM request starts even before the OS is resumed. > AFAIK the issue doesn't happen when s2Idle is used. A reset should disable PTM. I wonder if dock firmware could be enabling PTM? That would seem like a defect to me, because the dock can't know whether the Downstream Port in the laptop has PTM enabled. > > If the latter, I vote for a quirk that just disables PTM completely > > for this device. > > The S3 resume enables PTM regardless of OS involvement. So I don't > think this will work. That's ... weird. So something other than the OS enables PTM? Not sure how we can deal with that. Some (probably most) laptop Root Ports don't support PTM at all, so plugging the dock into one of those would always cause errors. > > This check in .error_detected() looks out of place to me because > > there's no connection between AER and PTM, there's no connection > > between PTM and the device being enabled, and the connection between > > the device being enabled and being fully resumed is a little tenuous. > > True. This patch is just a workaround. > > Have you considered my other proposed approach? Like disable AER > completely during suspend, or even defer the resuming of PCIe services > after the entire hierarchy is resumed? We might need to disable AER in certain cases where errors are unavoidable, like a surprise unplug that causes data link errors when the plug gets yanked out. But this PTM thing is not like that -- this is a configuration error, and I think we should fix the configuration instead of ignoring the error. > > > So avoid resetting the device if it's not resumed. Once the device is > > > fully resumed, the device can work normally. I'm a little confused about how this workaround works. The subject says "ignore AER reset", but here you say *avoid* resetting the device. I think we only reset things in the AER_FATAL case, and the PTM UR is an AER_NONFATAL error, so I'm not sure what the effect of this patch is. Also, igc_io_error_detected() needs to return one of the pci_ers_result values, not zero. Zero isn't a valid return value, so we can't rely on how pcie_do_recovery() handles it. > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216850 > > > Reviewed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > > > Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > > > > > --- > > > v2: > > > - Fix typo. > > > - Mention the product name. > > > > > > drivers/net/ethernet/intel/igc/igc_main.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c > > > index 9f93f0f4f752..8c36bbe5e428 100644 > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > > > @@ -7115,6 +7115,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, > > > struct net_device *netdev = pci_get_drvdata(pdev); > > > struct igc_adapter *adapter = netdev_priv(netdev); > > > > > > + if (!pci_is_enabled(pdev)) > > > + return 0; > > > + > > > netif_device_detach(netdev); > > > > > > if (state == pci_channel_io_perm_failure) > > > -- > > > 2.34.1 > > > > _______________________________________________ > Intel-wired-lan mailing list > Intel-wired-lan@osuosl.org > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c index 9f93f0f4f752..8c36bbe5e428 100644 --- a/drivers/net/ethernet/intel/igc/igc_main.c +++ b/drivers/net/ethernet/intel/igc/igc_main.c @@ -7115,6 +7115,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev, struct net_device *netdev = pci_get_drvdata(pdev); struct igc_adapter *adapter = netdev_priv(netdev); + if (!pci_is_enabled(pdev)) + return 0; + netif_device_detach(netdev); if (state == pci_channel_io_perm_failure)