diff mbox series

[net] tg3: power down device only on SYSTEM_POWER_OFF

Message ID 20231101130418.44164-1-george.shuklin@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] tg3: power down device only on SYSTEM_POWER_OFF | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1351 this patch: 1351
netdev/cc_maintainers fail 6 maintainers not CCed: pavan.chebbi@broadcom.com mchan@broadcom.com kuba@kernel.org pabeni@redhat.com davem@davemloft.net edumazet@google.com
netdev/build_clang success Errors and warnings before: 1369 this patch: 1369
netdev/verify_signedoff fail author Signed-off-by missing
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1380 this patch: 1380
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

George Shuklin Nov. 1, 2023, 1:04 p.m. UTC
Dell R650xs servers hangs if tg3 driver calls tg3_power_down.

This happens only if network adapters (BCM5720 for R650xs) were
initialized using SNP (e.g. by booting ipxe.efi).

This is partial revert of commit 2ca1c94ce0b.

The actual problem is on Dell side, but this fix allow servers
to come back alive after reboot.
---
 drivers/net/ethernet/broadcom/tg3.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Pavan Chebbi Nov. 1, 2023, 3:20 p.m. UTC | #1
On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote:
>
> Dell R650xs servers hangs if tg3 driver calls tg3_power_down.
>
> This happens only if network adapters (BCM5720 for R650xs) were
> initialized using SNP (e.g. by booting ipxe.efi).
>
> This is partial revert of commit 2ca1c94ce0b.
>
> The actual problem is on Dell side, but this fix allow servers
> to come back alive after reboot.

How are you sure that the problem solved by 2ca1c94ce0b is not
reintroduced with this change?

> ---
>  drivers/net/ethernet/broadcom/tg3.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..22b00912f7ac 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -18078,7 +18078,8 @@ static void tg3_shutdown(struct pci_dev *pdev)
>         if (netif_running(dev))
>                 dev_close(dev);
>
> -       tg3_power_down(tp);
> +       if (system_state == SYSTEM_POWER_OFF)
> +               tg3_power_down(tp);
>
>         rtnl_unlock();
>
> --
> 2.42.0
>
>
George Shuklin Nov. 1, 2023, 7:58 p.m. UTC | #2
On 01/11/2023 17:20, Pavan Chebbi wrote:
> On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote:
>> Dell R650xs servers hangs if tg3 driver calls tg3_power_down.
>>
>> This happens only if network adapters (BCM5720 for R650xs) were
>> initialized using SNP (e.g. by booting ipxe.efi).
>>
>> This is partial revert of commit 2ca1c94ce0b.
>>
>> The actual problem is on Dell side, but this fix allow servers
>> to come back alive after reboot.
> How are you sure that the problem solved by 2ca1c94ce0b is not
> reintroduced with this change?

I contacted the author of original patch, no reply yet (1st day). Also, 
I tested it on few generations of available Dell servers (R330, R340, 
R350 and R650sx, for which this fix should help). It does produce log 
message from 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at 
least, it reboots without issues.

Actually, original patch is regression: 5.19 rebooting just fine, 6.0 
start to hang. I also reported it to dell support forum, but I'm not 
sure if they pick it up or not.

What would be the proper course of actions for such problem (outside of 
fixing UEFI SNP, for which I don't have access to sources)?
Pavan Chebbi Nov. 2, 2023, 7:04 a.m. UTC | #3
On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote:
>
> On 01/11/2023 17:20, Pavan Chebbi wrote:
> > On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote:
> >> Dell R650xs servers hangs if tg3 driver calls tg3_power_down.
> >>
> >> This happens only if network adapters (BCM5720 for R650xs) were
> >> initialized using SNP (e.g. by booting ipxe.efi).
> >>
> >> This is partial revert of commit 2ca1c94ce0b.
> >>
> >> The actual problem is on Dell side, but this fix allow servers
> >> to come back alive after reboot.
> > How are you sure that the problem solved by 2ca1c94ce0b is not
> > reintroduced with this change?
>
> I contacted the author of original patch, no reply yet (1st day). Also,
> I tested it on few generations of available Dell servers (R330, R340,
> R350 and R650sx, for which this fix should help). It does produce log
> message from
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at
> least, it reboots without issues.
>
> Actually, original patch is regression: 5.19 rebooting just fine, 6.0
> start to hang. I also reported it to dell support forum, but I'm not
> sure if they pick it up or not.
>
> What would be the proper course of actions for such problem (outside of
> fixing UEFI SNP, for which I don't have access to sources)?
>
Thanks for the explanation. I am not sure if we should make this
change unless we are 100pc sure that this patch won't cause
regression.
I feel Dell is in the best position to debug this and they can even
contact Broadcom if they see any problem in UEFI.
George Shuklin Nov. 2, 2023, 2:14 p.m. UTC | #4
On 11/2/23 09:04, Pavan Chebbi wrote:
> On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote:
>> On 01/11/2023 17:20, Pavan Chebbi wrote:
>>> On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote:
>>>> Dell R650xs servers hangs if tg3 driver calls tg3_power_down.
>>>>
>>>> This happens only if network adapters (BCM5720 for R650xs) were
>>>> initialized using SNP (e.g. by booting ipxe.efi).
>>>>
>>>> This is partial revert of commit 2ca1c94ce0b.
>>>>
>>>> The actual problem is on Dell side, but this fix allow servers
>>>> to come back alive after reboot.
>>> How are you sure that the problem solved by 2ca1c94ce0b is not
>>> reintroduced with this change?
>> I contacted the author of original patch, no reply yet (1st day). Also,
>> I tested it on few generations of available Dell servers (R330, R340,
>> R350 and R650sx, for which this fix should help). It does produce log
>> message from
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at
>> least, it reboots without issues.
>>
>> Actually, original patch is regression: 5.19 rebooting just fine, 6.0
>> start to hang. I also reported it to dell support forum, but I'm not
>> sure if they pick it up or not.
>>
>> What would be the proper course of actions for such problem (outside of
>> fixing UEFI SNP, for which I don't have access to sources)?
>>
> Thanks for the explanation. I am not sure if we should make this
> change unless we are 100pc sure that this patch won't cause
> regression.
> I feel Dell is in the best position to debug this and they can even
> contact Broadcom if they see any problem in UEFI.

I'm right now with dell support, and what they asked is to 'try this on 
supported distros', which at newest are 5.15. I'll try to bypass their 
L1 with Ubuntu + HWE to get to 6+ versions...

I was able to reproduce hanging at reboot there (without ACPI messages), 
and patching helps there too.
Pavan Chebbi Nov. 2, 2023, 5:11 p.m. UTC | #5
On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote:
>
> On 11/2/23 09:04, Pavan Chebbi wrote:
> > On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote:
> >> On 01/11/2023 17:20, Pavan Chebbi wrote:
> >>> On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote:
> >>>> Dell R650xs servers hangs if tg3 driver calls tg3_power_down.
> >>>>
> >>>> This happens only if network adapters (BCM5720 for R650xs) were
> >>>> initialized using SNP (e.g. by booting ipxe.efi).
> >>>>
> >>>> This is partial revert of commit 2ca1c94ce0b.
> >>>>
> >>>> The actual problem is on Dell side, but this fix allow servers
> >>>> to come back alive after reboot.
> >>> How are you sure that the problem solved by 2ca1c94ce0b is not
> >>> reintroduced with this change?
> >> I contacted the author of original patch, no reply yet (1st day). Also,
> >> I tested it on few generations of available Dell servers (R330, R340,
> >> R350 and R650sx, for which this fix should help). It does produce log
> >> message from
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at
> >> least, it reboots without issues.
> >>
> >> Actually, original patch is regression: 5.19 rebooting just fine, 6.0
> >> start to hang. I also reported it to dell support forum, but I'm not
> >> sure if they pick it up or not.
> >>
> >> What would be the proper course of actions for such problem (outside of
> >> fixing UEFI SNP, for which I don't have access to sources)?
> >>
> > Thanks for the explanation. I am not sure if we should make this
> > change unless we are 100pc sure that this patch won't cause
> > regression.
> > I feel Dell is in the best position to debug this and they can even
> > contact Broadcom if they see any problem in UEFI.
>
> I'm right now with dell support, and what they asked is to 'try this on
> supported distros', which at newest are 5.15. I'll try to bypass their
> L1 with Ubuntu + HWE to get to 6+ versions...
>
> I was able to reproduce hanging at reboot there (without ACPI messages),
> and patching helps there too.
>
OK. I am not too sure what we should do. The change as such looks fine to me.
Of course, the patch needs proper tags (tree.fixes, cc ...)

@Michael : Do you have any suggestions on this?
Michael Chan Nov. 2, 2023, 5:49 p.m. UTC | #6
On Thu, Nov 2, 2023 at 10:11 AM Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
>
> On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote:
> >
> > I'm right now with dell support, and what they asked is to 'try this on
> > supported distros', which at newest are 5.15. I'll try to bypass their
> > L1 with Ubuntu + HWE to get to 6+ versions...
> >
> > I was able to reproduce hanging at reboot there (without ACPI messages),
> > and patching helps there too.
> >
> OK. I am not too sure what we should do. The change as such looks fine to me.
> Of course, the patch needs proper tags (tree.fixes, cc ...)
>
> @Michael : Do you have any suggestions on this?

I think that this logic:

if (system_state == SYSTEM_POWER_OFF)
        tg3_power_down(tp);

is correct in the shutdown method.  Many other drivers do the same
thing.  tg3_power_down() is just to enable WoL if it should be enabled
and to put the device in D3hot state.

The original commit to change this (2ca1c94ce0b6 tg3: Disable tg3
device on system reboot to avoid triggering AER) claims that calling
tg3_power_down() unconditionally is to avoid getting spurious MSI.
But the dev_close() call in tg3_shutdown() should have shut everything
down including MSI.  So I don't think it is necessary to call
tg3_power_down() unconditionally.
Pavan Chebbi Nov. 3, 2023, 3:48 a.m. UTC | #7
On Thu, Nov 2, 2023 at 11:19 PM Michael Chan <michael.chan@broadcom.com> wrote:
>
> On Thu, Nov 2, 2023 at 10:11 AM Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
> >
> > On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote:
> > >
> > > I'm right now with dell support, and what they asked is to 'try this on
> > > supported distros', which at newest are 5.15. I'll try to bypass their
> > > L1 with Ubuntu + HWE to get to 6+ versions...
> > >
> > > I was able to reproduce hanging at reboot there (without ACPI messages),
> > > and patching helps there too.
> > >
> > OK. I am not too sure what we should do. The change as such looks fine to me.
> > Of course, the patch needs proper tags (tree.fixes, cc ...)
> >
> > @Michael : Do you have any suggestions on this?
>
> I think that this logic:
>
> if (system_state == SYSTEM_POWER_OFF)
>         tg3_power_down(tp);
>
> is correct in the shutdown method.  Many other drivers do the same
> thing.  tg3_power_down() is just to enable WoL if it should be enabled
> and to put the device in D3hot state.
>
> The original commit to change this (2ca1c94ce0b6 tg3: Disable tg3
> device on system reboot to avoid triggering AER) claims that calling
> tg3_power_down() unconditionally is to avoid getting spurious MSI.
> But the dev_close() call in tg3_shutdown() should have shut everything
> down including MSI.  So I don't think it is necessary to call
> tg3_power_down() unconditionally.

Agree. So let us have this change in. George, can you pls spin a v2
with proper tagging/addressing done to the patch.
Please see https://docs.kernel.org/process/submitting-patches.html for
help. Thanks.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..22b00912f7ac 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -18078,7 +18078,8 @@  static void tg3_shutdown(struct pci_dev *pdev)
 	if (netif_running(dev))
 		dev_close(dev);
 
-	tg3_power_down(tp);
+	if (system_state == SYSTEM_POWER_OFF)
+		tg3_power_down(tp);
 
 	rtnl_unlock();