Message ID | 20231101130418.44164-1-george.shuklin@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tg3: power down device only on SYSTEM_POWER_OFF | expand |
On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote: > > Dell R650xs servers hangs if tg3 driver calls tg3_power_down. > > This happens only if network adapters (BCM5720 for R650xs) were > initialized using SNP (e.g. by booting ipxe.efi). > > This is partial revert of commit 2ca1c94ce0b. > > The actual problem is on Dell side, but this fix allow servers > to come back alive after reboot. How are you sure that the problem solved by 2ca1c94ce0b is not reintroduced with this change? > --- > drivers/net/ethernet/broadcom/tg3.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c > index 14b311196b8f..22b00912f7ac 100644 > --- a/drivers/net/ethernet/broadcom/tg3.c > +++ b/drivers/net/ethernet/broadcom/tg3.c > @@ -18078,7 +18078,8 @@ static void tg3_shutdown(struct pci_dev *pdev) > if (netif_running(dev)) > dev_close(dev); > > - tg3_power_down(tp); > + if (system_state == SYSTEM_POWER_OFF) > + tg3_power_down(tp); > > rtnl_unlock(); > > -- > 2.42.0 > >
On 01/11/2023 17:20, Pavan Chebbi wrote: > On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote: >> Dell R650xs servers hangs if tg3 driver calls tg3_power_down. >> >> This happens only if network adapters (BCM5720 for R650xs) were >> initialized using SNP (e.g. by booting ipxe.efi). >> >> This is partial revert of commit 2ca1c94ce0b. >> >> The actual problem is on Dell side, but this fix allow servers >> to come back alive after reboot. > How are you sure that the problem solved by 2ca1c94ce0b is not > reintroduced with this change? I contacted the author of original patch, no reply yet (1st day). Also, I tested it on few generations of available Dell servers (R330, R340, R350 and R650sx, for which this fix should help). It does produce log message from https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at least, it reboots without issues. Actually, original patch is regression: 5.19 rebooting just fine, 6.0 start to hang. I also reported it to dell support forum, but I'm not sure if they pick it up or not. What would be the proper course of actions for such problem (outside of fixing UEFI SNP, for which I don't have access to sources)?
On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote: > > On 01/11/2023 17:20, Pavan Chebbi wrote: > > On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote: > >> Dell R650xs servers hangs if tg3 driver calls tg3_power_down. > >> > >> This happens only if network adapters (BCM5720 for R650xs) were > >> initialized using SNP (e.g. by booting ipxe.efi). > >> > >> This is partial revert of commit 2ca1c94ce0b. > >> > >> The actual problem is on Dell side, but this fix allow servers > >> to come back alive after reboot. > > How are you sure that the problem solved by 2ca1c94ce0b is not > > reintroduced with this change? > > I contacted the author of original patch, no reply yet (1st day). Also, > I tested it on few generations of available Dell servers (R330, R340, > R350 and R650sx, for which this fix should help). It does produce log > message from > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at > least, it reboots without issues. > > Actually, original patch is regression: 5.19 rebooting just fine, 6.0 > start to hang. I also reported it to dell support forum, but I'm not > sure if they pick it up or not. > > What would be the proper course of actions for such problem (outside of > fixing UEFI SNP, for which I don't have access to sources)? > Thanks for the explanation. I am not sure if we should make this change unless we are 100pc sure that this patch won't cause regression. I feel Dell is in the best position to debug this and they can even contact Broadcom if they see any problem in UEFI.
On 11/2/23 09:04, Pavan Chebbi wrote: > On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote: >> On 01/11/2023 17:20, Pavan Chebbi wrote: >>> On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote: >>>> Dell R650xs servers hangs if tg3 driver calls tg3_power_down. >>>> >>>> This happens only if network adapters (BCM5720 for R650xs) were >>>> initialized using SNP (e.g. by booting ipxe.efi). >>>> >>>> This is partial revert of commit 2ca1c94ce0b. >>>> >>>> The actual problem is on Dell side, but this fix allow servers >>>> to come back alive after reboot. >>> How are you sure that the problem solved by 2ca1c94ce0b is not >>> reintroduced with this change? >> I contacted the author of original patch, no reply yet (1st day). Also, >> I tested it on few generations of available Dell servers (R330, R340, >> R350 and R650sx, for which this fix should help). It does produce log >> message from >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at >> least, it reboots without issues. >> >> Actually, original patch is regression: 5.19 rebooting just fine, 6.0 >> start to hang. I also reported it to dell support forum, but I'm not >> sure if they pick it up or not. >> >> What would be the proper course of actions for such problem (outside of >> fixing UEFI SNP, for which I don't have access to sources)? >> > Thanks for the explanation. I am not sure if we should make this > change unless we are 100pc sure that this patch won't cause > regression. > I feel Dell is in the best position to debug this and they can even > contact Broadcom if they see any problem in UEFI. I'm right now with dell support, and what they asked is to 'try this on supported distros', which at newest are 5.15. I'll try to bypass their L1 with Ubuntu + HWE to get to 6+ versions... I was able to reproduce hanging at reboot there (without ACPI messages), and patching helps there too.
On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote: > > On 11/2/23 09:04, Pavan Chebbi wrote: > > On Thu, Nov 2, 2023 at 1:28 AM George Shuklin <george.shuklin@gmail.com> wrote: > >> On 01/11/2023 17:20, Pavan Chebbi wrote: > >>> On Wed, Nov 1, 2023 at 6:34 PM George Shuklin <george.shuklin@gmail.com> wrote: > >>>> Dell R650xs servers hangs if tg3 driver calls tg3_power_down. > >>>> > >>>> This happens only if network adapters (BCM5720 for R650xs) were > >>>> initialized using SNP (e.g. by booting ipxe.efi). > >>>> > >>>> This is partial revert of commit 2ca1c94ce0b. > >>>> > >>>> The actual problem is on Dell side, but this fix allow servers > >>>> to come back alive after reboot. > >>> How are you sure that the problem solved by 2ca1c94ce0b is not > >>> reintroduced with this change? > >> I contacted the author of original patch, no reply yet (1st day). Also, > >> I tested it on few generations of available Dell servers (R330, R340, > >> R350 and R650sx, for which this fix should help). It does produce log > >> message from > >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471, but, at > >> least, it reboots without issues. > >> > >> Actually, original patch is regression: 5.19 rebooting just fine, 6.0 > >> start to hang. I also reported it to dell support forum, but I'm not > >> sure if they pick it up or not. > >> > >> What would be the proper course of actions for such problem (outside of > >> fixing UEFI SNP, for which I don't have access to sources)? > >> > > Thanks for the explanation. I am not sure if we should make this > > change unless we are 100pc sure that this patch won't cause > > regression. > > I feel Dell is in the best position to debug this and they can even > > contact Broadcom if they see any problem in UEFI. > > I'm right now with dell support, and what they asked is to 'try this on > supported distros', which at newest are 5.15. I'll try to bypass their > L1 with Ubuntu + HWE to get to 6+ versions... > > I was able to reproduce hanging at reboot there (without ACPI messages), > and patching helps there too. > OK. I am not too sure what we should do. The change as such looks fine to me. Of course, the patch needs proper tags (tree.fixes, cc ...) @Michael : Do you have any suggestions on this?
On Thu, Nov 2, 2023 at 10:11 AM Pavan Chebbi <pavan.chebbi@broadcom.com> wrote: > > On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote: > > > > I'm right now with dell support, and what they asked is to 'try this on > > supported distros', which at newest are 5.15. I'll try to bypass their > > L1 with Ubuntu + HWE to get to 6+ versions... > > > > I was able to reproduce hanging at reboot there (without ACPI messages), > > and patching helps there too. > > > OK. I am not too sure what we should do. The change as such looks fine to me. > Of course, the patch needs proper tags (tree.fixes, cc ...) > > @Michael : Do you have any suggestions on this? I think that this logic: if (system_state == SYSTEM_POWER_OFF) tg3_power_down(tp); is correct in the shutdown method. Many other drivers do the same thing. tg3_power_down() is just to enable WoL if it should be enabled and to put the device in D3hot state. The original commit to change this (2ca1c94ce0b6 tg3: Disable tg3 device on system reboot to avoid triggering AER) claims that calling tg3_power_down() unconditionally is to avoid getting spurious MSI. But the dev_close() call in tg3_shutdown() should have shut everything down including MSI. So I don't think it is necessary to call tg3_power_down() unconditionally.
On Thu, Nov 2, 2023 at 11:19 PM Michael Chan <michael.chan@broadcom.com> wrote: > > On Thu, Nov 2, 2023 at 10:11 AM Pavan Chebbi <pavan.chebbi@broadcom.com> wrote: > > > > On Thu, Nov 2, 2023 at 7:44 PM George Shuklin <george.shuklin@gmail.com> wrote: > > > > > > I'm right now with dell support, and what they asked is to 'try this on > > > supported distros', which at newest are 5.15. I'll try to bypass their > > > L1 with Ubuntu + HWE to get to 6+ versions... > > > > > > I was able to reproduce hanging at reboot there (without ACPI messages), > > > and patching helps there too. > > > > > OK. I am not too sure what we should do. The change as such looks fine to me. > > Of course, the patch needs proper tags (tree.fixes, cc ...) > > > > @Michael : Do you have any suggestions on this? > > I think that this logic: > > if (system_state == SYSTEM_POWER_OFF) > tg3_power_down(tp); > > is correct in the shutdown method. Many other drivers do the same > thing. tg3_power_down() is just to enable WoL if it should be enabled > and to put the device in D3hot state. > > The original commit to change this (2ca1c94ce0b6 tg3: Disable tg3 > device on system reboot to avoid triggering AER) claims that calling > tg3_power_down() unconditionally is to avoid getting spurious MSI. > But the dev_close() call in tg3_shutdown() should have shut everything > down including MSI. So I don't think it is necessary to call > tg3_power_down() unconditionally. Agree. So let us have this change in. George, can you pls spin a v2 with proper tagging/addressing done to the patch. Please see https://docs.kernel.org/process/submitting-patches.html for help. Thanks.
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 14b311196b8f..22b00912f7ac 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -18078,7 +18078,8 @@ static void tg3_shutdown(struct pci_dev *pdev) if (netif_running(dev)) dev_close(dev); - tg3_power_down(tp); + if (system_state == SYSTEM_POWER_OFF) + tg3_power_down(tp); rtnl_unlock();