diff mbox series

[net-next,resubmit,v2] r8169: disable ASPM in case of tx timeout

Message ID 92369a92-dc32-4529-0509-11459ba0e391@gmail.com (mailing list archive)
State Accepted
Commit 80c0576ef179311f624bc450fede30a89afe9792
Delegated to: Netdev Maintainers
Headers show
Series [net-next,resubmit,v2] r8169: disable ASPM in case of tx timeout | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 84 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Heiner Kallweit Jan. 10, 2023, 10:03 p.m. UTC
There are still single reports of systems where ASPM incompatibilities
cause tx timeouts. It's not clear whom to blame, so let's disable
ASPM in case of a tx timeout.

v2:
- add one-time warning for informing the user

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Alexander Duyck Jan. 11, 2023, 4:16 p.m. UTC | #1
On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
> There are still single reports of systems where ASPM incompatibilities
> cause tx timeouts. It's not clear whom to blame, so let's disable
> ASPM in case of a tx timeout.
> 
> v2:
> - add one-time warning for informing the user
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

From past experience I have seen ASPM issues cause the device to
disappear from the bus after failing to come out of L1. If that occurs
this won't be able to recover after the timeout without resetting the
bus itself. As such it may be necessary to disable the link states
prior to using the device rather than waiting until after the error.
That can be addressed in a follow-on patch if this doesn't resolve the
issue.

As for the code it looks fine to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Heiner Kallweit Jan. 11, 2023, 8:17 p.m. UTC | #2
On 11.01.2023 17:16, Alexander H Duyck wrote:
> On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
>> There are still single reports of systems where ASPM incompatibilities
>> cause tx timeouts. It's not clear whom to blame, so let's disable
>> ASPM in case of a tx timeout.
>>
>> v2:
>> - add one-time warning for informing the user
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> 
>>From past experience I have seen ASPM issues cause the device to
> disappear from the bus after failing to come out of L1. If that occurs
> this won't be able to recover after the timeout without resetting the
> bus itself. As such it may be necessary to disable the link states
> prior to using the device rather than waiting until after the error.
> That can be addressed in a follow-on patch if this doesn't resolve the
> issue.
> 

Interesting, reports about disappearing devices I haven't seen yet.
Symptoms I've seen differ, based on combination of more or less faulty
NIC chipset version, BIOS bugs, PCIe mainboard chipset.
Typically users experienced missed rx packets, tx timeouts or NIC lockups.
Disabling ASPM resulted in complaints of notebook users about reduced
system runtime on battery.
Meanwhile we found a good balance and reports about ASPM issues
became quite rare.
Just L1.2 still causes issues under load even with newer chipset versions,
therefore L1.2 is disabled per default.

> As for the code it looks fine to me.
> 
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Alexander Duyck Jan. 11, 2023, 10:38 p.m. UTC | #3
On Wed, Jan 11, 2023 at 12:17 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 11.01.2023 17:16, Alexander H Duyck wrote:
> > On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
> >> There are still single reports of systems where ASPM incompatibilities
> >> cause tx timeouts. It's not clear whom to blame, so let's disable
> >> ASPM in case of a tx timeout.
> >>
> >> v2:
> >> - add one-time warning for informing the user
> >>
> >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> >
> >>From past experience I have seen ASPM issues cause the device to
> > disappear from the bus after failing to come out of L1. If that occurs
> > this won't be able to recover after the timeout without resetting the
> > bus itself. As such it may be necessary to disable the link states
> > prior to using the device rather than waiting until after the error.
> > That can be addressed in a follow-on patch if this doesn't resolve the
> > issue.
> >
>
> Interesting, reports about disappearing devices I haven't seen yet.
> Symptoms I've seen differ, based on combination of more or less faulty
> NIC chipset version, BIOS bugs, PCIe mainboard chipset.
> Typically users experienced missed rx packets, tx timeouts or NIC lockups.
> Disabling ASPM resulted in complaints of notebook users about reduced
> system runtime on battery.
> Meanwhile we found a good balance and reports about ASPM issues
> became quite rare.
> Just L1.2 still causes issues under load even with newer chipset versions,
> therefore L1.2 is disabled per default.

Does your driver do any checking for MMIO failures on reads? Basically
when the device disappears it should start returning ~0 on mmio reads.
The device itself doesn't disappear, but it doesn't respond to
requests anymore so it might be the "NIC lockups" case you mentioned.
The Intel parts would disappear as they would trigger their "surprise
removal" logic which would detach the netdevice. I have seen that
issue on some platforms. It is kind of interesting when you can
actually watch it happen as the issue was essentially a marginal PCIe
connection so it would start out at x4, then renegotiate down with
each ASPM L1 link bounce, and eventually it would end up at x1 before
just dropping off the bus.

I agree pro-actively disabling ASPM is bad for power savings. So if
this approach can resolve it then I am more than willing to give it a
try. My main concern is if MMIO is already borked, updating the ASPM
settings may not be enough to bring it back and it may require a
secondary bus reset.
Heiner Kallweit Jan. 11, 2023, 10:57 p.m. UTC | #4
On 11.01.2023 23:38, Alexander Duyck wrote:
> On Wed, Jan 11, 2023 at 12:17 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 11.01.2023 17:16, Alexander H Duyck wrote:
>>> On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
>>>> There are still single reports of systems where ASPM incompatibilities
>>>> cause tx timeouts. It's not clear whom to blame, so let's disable
>>>> ASPM in case of a tx timeout.
>>>>
>>>> v2:
>>>> - add one-time warning for informing the user
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>
>>> >From past experience I have seen ASPM issues cause the device to
>>> disappear from the bus after failing to come out of L1. If that occurs
>>> this won't be able to recover after the timeout without resetting the
>>> bus itself. As such it may be necessary to disable the link states
>>> prior to using the device rather than waiting until after the error.
>>> That can be addressed in a follow-on patch if this doesn't resolve the
>>> issue.
>>>
>>
>> Interesting, reports about disappearing devices I haven't seen yet.
>> Symptoms I've seen differ, based on combination of more or less faulty
>> NIC chipset version, BIOS bugs, PCIe mainboard chipset.
>> Typically users experienced missed rx packets, tx timeouts or NIC lockups.
>> Disabling ASPM resulted in complaints of notebook users about reduced
>> system runtime on battery.
>> Meanwhile we found a good balance and reports about ASPM issues
>> became quite rare.
>> Just L1.2 still causes issues under load even with newer chipset versions,
>> therefore L1.2 is disabled per default.
> 
> Does your driver do any checking for MMIO failures on reads? Basically
> when the device disappears it should start returning ~0 on mmio reads.

Not yet, good idea.

> The device itself doesn't disappear, but it doesn't respond to
> requests anymore so it might be the "NIC lockups" case you mentioned.
> The Intel parts would disappear as they would trigger their "surprise
> removal" logic which would detach the netdevice. I have seen that
> issue on some platforms. It is kind of interesting when you can
> actually watch it happen as the issue was essentially a marginal PCIe
> connection so it would start out at x4, then renegotiate down with
> each ASPM L1 link bounce, and eventually it would end up at x1 before
> just dropping off the bus.
> 
> I agree pro-actively disabling ASPM is bad for power savings. So if
> this approach can resolve it then I am more than willing to give it a
> try. My main concern is if MMIO is already borked, updating the ASPM
> settings may not be enough to bring it back and it may require a
> secondary bus reset.

I'll think about how to further improve recovery in case of ASPM issues.
patchwork-bot+netdevbpf@kernel.org Jan. 12, 2023, 4:20 a.m. UTC | #5
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 10 Jan 2023 23:03:18 +0100 you wrote:
> There are still single reports of systems where ASPM incompatibilities
> cause tx timeouts. It's not clear whom to blame, so let's disable
> ASPM in case of a tx timeout.
> 
> v2:
> - add one-time warning for informing the user
> 
> [...]

Here is the summary with links:
  - [net-next,resubmit,v2] r8169: disable ASPM in case of tx timeout
    https://git.kernel.org/netdev/net-next/c/80c0576ef179

You are awesome, thank you!
diff mbox series

Patch

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index a9dcc98b6..49c124d8e 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -576,6 +576,7 @@  struct rtl8169_tc_offsets {
 enum rtl_flag {
 	RTL_FLAG_TASK_ENABLED = 0,
 	RTL_FLAG_TASK_RESET_PENDING,
+	RTL_FLAG_TASK_TX_TIMEOUT,
 	RTL_FLAG_MAX
 };
 
@@ -3931,7 +3932,7 @@  static void rtl8169_tx_timeout(struct net_device *dev, unsigned int txqueue)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 
-	rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
+	rtl_schedule_task(tp, RTL_FLAG_TASK_TX_TIMEOUT);
 }
 
 static int rtl8169_tx_map(struct rtl8169_private *tp, const u32 *opts, u32 len,
@@ -4525,6 +4526,7 @@  static void rtl_task(struct work_struct *work)
 {
 	struct rtl8169_private *tp =
 		container_of(work, struct rtl8169_private, wk.work);
+	int ret;
 
 	rtnl_lock();
 
@@ -4532,7 +4534,17 @@  static void rtl_task(struct work_struct *work)
 	    !test_bit(RTL_FLAG_TASK_ENABLED, tp->wk.flags))
 		goto out_unlock;
 
+	if (test_and_clear_bit(RTL_FLAG_TASK_TX_TIMEOUT, tp->wk.flags)) {
+		/* ASPM compatibility issues are a typical reason for tx timeouts */
+		ret = pci_disable_link_state(tp->pci_dev, PCIE_LINK_STATE_L1 |
+							  PCIE_LINK_STATE_L0S);
+		if (!ret)
+			netdev_warn_once(tp->dev, "ASPM disabled on Tx timeout\n");
+		goto reset;
+	}
+
 	if (test_and_clear_bit(RTL_FLAG_TASK_RESET_PENDING, tp->wk.flags)) {
+reset:
 		rtl_reset_work(tp);
 		netif_wake_queue(tp->dev);
 	}