diff mbox series

[v2,net] qede: fix firmware halt over suspend and resume

Message ID 20230809134339.698074-1-manishc@marvell.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2,net] qede: fix firmware halt over suspend and resume | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1328 this patch: 1328
netdev/cc_maintainers fail 2 blamed authors not CCed: Yuval.Mintz@qlogic.com Sudarsana.Kalluru@qlogic.com; 2 maintainers not CCed: Yuval.Mintz@qlogic.com Sudarsana.Kalluru@qlogic.com
netdev/build_clang success Errors and warnings before: 1351 this patch: 1351
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1351 this patch: 1351
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 25 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Manish Chopra Aug. 9, 2023, 1:43 p.m. UTC
While performing certain power-off sequences, PCI drivers are
called to suspend and resume their underlying devices through
PCI PM (power management) interface. However this NIC hardware
does not support PCI PM suspend/resume operations so system wide
suspend/resume leads to bad MFW (management firmware) state which
causes various follow-up errors in driver when communicating with
the device/firmware afterwards.

To fix this driver implements PCI PM suspend handler to indicate
unsupported operation to the PCI subsystem explicitly, thus avoiding
system to go into suspended/standby mode.

Fixes: 2950219d87b0 ("qede: Add basic network device support")
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Alok Prasad <palok@marvell.com>
---
V1->V2:
* Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Simon Horman Aug. 10, 2023, 6:02 p.m. UTC | #1
On Wed, Aug 09, 2023 at 07:13:39PM +0530, Manish Chopra wrote:
> While performing certain power-off sequences, PCI drivers are
> called to suspend and resume their underlying devices through
> PCI PM (power management) interface. However this NIC hardware
> does not support PCI PM suspend/resume operations so system wide
> suspend/resume leads to bad MFW (management firmware) state which
> causes various follow-up errors in driver when communicating with
> the device/firmware afterwards.
> 
> To fix this driver implements PCI PM suspend handler to indicate
> unsupported operation to the PCI subsystem explicitly, thus avoiding
> system to go into suspended/standby mode.
> 
> Fixes: 2950219d87b0 ("qede: Add basic network device support")
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Manish Chopra <manishc@marvell.com>
> Signed-off-by: Alok Prasad <palok@marvell.com>
> ---
> V1->V2:
> * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS

Thanks!

Reviewed-by: Simon Horman <horms@kernel.org>
Jakub Kicinski Aug. 11, 2023, 12:47 a.m. UTC | #2
On Wed, 9 Aug 2023 19:13:39 +0530 Manish Chopra wrote:
> While performing certain power-off sequences, PCI drivers are
> called to suspend and resume their underlying devices through
> PCI PM (power management) interface. However this NIC hardware
> does not support PCI PM suspend/resume operations so system wide
> suspend/resume leads to bad MFW (management firmware) state which
> causes various follow-up errors in driver when communicating with
> the device/firmware afterwards.

Does the FW end up recovering? That could still be preferable
to rejecting suspend altogether. Reject is a big hammer,
I'm a bit worried it will cause a regression in stable.

> To fix this driver implements PCI PM suspend handler to indicate
> unsupported operation to the PCI subsystem explicitly, thus avoiding
> system to go into suspended/standby mode.
> 
> Fixes: 2950219d87b0 ("qede: Add basic network device support")
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Manish Chopra <manishc@marvell.com>
> Signed-off-by: Alok Prasad <palok@marvell.com>
> ---
> V1->V2:
> * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS
> ---
>  drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
> index d57e52a97f85..18ae7af1764c 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_main.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
> @@ -177,6 +177,18 @@ static int qede_sriov_configure(struct pci_dev *pdev, int num_vfs_param)
>  }
>  #endif
>  
> +static int __maybe_unused qede_suspend(struct device *dev)
> +{
> +	if (!dev)
> +		return -ENODEV;

Can dev really be NULL here? That wouldn't make sense, what's the
driver supposed to do in such case?
Manish Chopra Aug. 11, 2023, 9:31 a.m. UTC | #3
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, August 11, 2023 6:17 AM
> To: Manish Chopra <manishc@marvell.com>
> Cc: netdev@vger.kernel.org; Ariel Elior <aelior@marvell.com>; Alok Prasad
> <palok@marvell.com>; Nilesh Javali <njavali@marvell.com>; Saurav Kashyap
> <skashyap@marvell.com>; jmeneghi@redhat.com; yuval.mintz@qlogic.com;
> Sudarsana Reddy Kalluru <skalluru@marvell.com>; pabeni@redhat.com;
> edumazet@google.com; horms@kernel.org; David Miller
> <davem@davemloft.net>
> Subject: [EXT] Re: [PATCH v2 net] qede: fix firmware halt over suspend and
> resume
> 
> External Email
> 
> ----------------------------------------------------------------------
> On Wed, 9 Aug 2023 19:13:39 +0530 Manish Chopra wrote:
> > While performing certain power-off sequences, PCI drivers are called
> > to suspend and resume their underlying devices through PCI PM (power
> > management) interface. However this NIC hardware does not support PCI
> > PM suspend/resume operations so system wide suspend/resume leads to
> > bad MFW (management firmware) state which causes various follow-up
> > errors in driver when communicating with the device/firmware
> > afterwards.
> 
> Does the FW end up recovering? That could still be preferable to rejecting
> suspend altogether. Reject is a big hammer, I'm a bit worried it will cause a
> regression in stable.

Yes, By adding the driver's suspend handler with explicit error returned 
to PCI subsystem prevents the system wide suspend and does not impact the
device/FW at all. It keeps them operational as they were before.

> 
> > To fix this driver implements PCI PM suspend handler to indicate
> > unsupported operation to the PCI subsystem explicitly, thus avoiding
> > system to go into suspended/standby mode.
> >
> > Fixes: 2950219d87b0 ("qede: Add basic network device support")
> > Cc: David Miller <davem@davemloft.net>
> > Signed-off-by: Manish Chopra <manishc@marvell.com>
> > Signed-off-by: Alok Prasad <palok@marvell.com>
> > ---
> > V1->V2:
> > * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS
> > ---
> >  drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c
> > b/drivers/net/ethernet/qlogic/qede/qede_main.c
> > index d57e52a97f85..18ae7af1764c 100644
> > --- a/drivers/net/ethernet/qlogic/qede/qede_main.c
> > +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
> > @@ -177,6 +177,18 @@ static int qede_sriov_configure(struct pci_dev
> > *pdev, int num_vfs_param)  }  #endif
> >
> > +static int __maybe_unused qede_suspend(struct device *dev) {
> > +	if (!dev)
> > +		return -ENODEV;
> 
> Can dev really be NULL here? That wouldn't make sense, what's the driver
> supposed to do in such case?

It's not supposed to be NULL here assuming caller must be validating it way
before. I just put it for sanity. I will remove it.

> --
> pw-bot: cr
Jakub Kicinski Aug. 11, 2023, 9:45 p.m. UTC | #4
On Fri, 11 Aug 2023 09:31:15 +0000 Manish Chopra wrote:
> > Does the FW end up recovering? That could still be preferable to rejecting
> > suspend altogether. Reject is a big hammer, I'm a bit worried it will cause a
> > regression in stable.  
> 
> Yes, By adding the driver's suspend handler with explicit error returned 
> to PCI subsystem prevents the system wide suspend and does not impact the
> device/FW at all. It keeps them operational as they were before.

I'm asking about recovery without this patch, not with it.
That should be evident from the text I'm replying under.
Manish Chopra Aug. 14, 2023, 10:24 a.m. UTC | #5
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, August 12, 2023 3:15 AM
> To: Manish Chopra <manishc@marvell.com>
> Cc: netdev@vger.kernel.org; Ariel Elior <aelior@marvell.com>; Alok Prasad
> <palok@marvell.com>; Nilesh Javali <njavali@marvell.com>; Saurav Kashyap
> <skashyap@marvell.com>; jmeneghi@redhat.com; yuval.mintz@qlogic.com;
> Sudarsana Reddy Kalluru <skalluru@marvell.com>; pabeni@redhat.com;
> edumazet@google.com; horms@kernel.org; David Miller
> <davem@davemloft.net>
> Subject: Re: [EXT] Re: [PATCH v2 net] qede: fix firmware halt over suspend and
> resume
> 
> On Fri, 11 Aug 2023 09:31:15 +0000 Manish Chopra wrote:
> > > Does the FW end up recovering? That could still be preferable to
> > > rejecting suspend altogether. Reject is a big hammer, I'm a bit
> > > worried it will cause a regression in stable.
> >
> > Yes, By adding the driver's suspend handler with explicit error
> > returned to PCI subsystem prevents the system wide suspend and does
> > not impact the device/FW at all. It keeps them operational as they were
> before.
> 
> I'm asking about recovery without this patch, not with it.
> That should be evident from the text I'm replying under.

Nope, It does not recover. We have to power cycle the system to recover.
Jakub Kicinski Aug. 14, 2023, 3:17 p.m. UTC | #6
On Mon, 14 Aug 2023 10:24:52 +0000 Manish Chopra wrote:
> > I'm asking about recovery without this patch, not with it.
> > That should be evident from the text I'm replying under.  
> 
> Nope, It does not recover. We have to power cycle the system to recover.

Alright, please state that in the commit message and drop the
unnecessary NULL check for v2.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index d57e52a97f85..18ae7af1764c 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -177,6 +177,18 @@  static int qede_sriov_configure(struct pci_dev *pdev, int num_vfs_param)
 }
 #endif
 
+static int __maybe_unused qede_suspend(struct device *dev)
+{
+	if (!dev)
+		return -ENODEV;
+
+	dev_info(dev, "Device does not support suspend operation\n");
+
+	return -EOPNOTSUPP;
+}
+
+static DEFINE_SIMPLE_DEV_PM_OPS(qede_pm_ops, qede_suspend, NULL);
+
 static const struct pci_error_handlers qede_err_handler = {
 	.error_detected = qede_io_error_detected,
 };
@@ -191,6 +203,7 @@  static struct pci_driver qede_pci_driver = {
 	.sriov_configure = qede_sriov_configure,
 #endif
 	.err_handler = &qede_err_handler,
+	.driver.pm = &qede_pm_ops,
 };
 
 static struct qed_eth_cb_ops qede_ll_ops = {