diff mbox series

[iwl-net,v3] ice: reset first in crash dump kernels

Message ID 20231004170214.474792-1-jesse.brandeburg@intel.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [iwl-net,v3] ice: reset first in crash dump kernels | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1340 this patch: 1340
netdev/cc_maintainers fail 1 blamed authors not CCed: anirudh.venkataramanan@intel.com; 6 maintainers not CCed: pabeni@redhat.com davem@davemloft.net edumazet@google.com anthony.l.nguyen@intel.com anirudh.venkataramanan@intel.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 1363 this patch: 1363
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes fail Problems with Fixes tag: 1
netdev/build_allmodconfig_warn success Errors and warnings before: 1363 this patch: 1363
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesse Brandeburg Oct. 4, 2023, 5:02 p.m. UTC
When the system boots into the crash dump kernel after a panic, the ice
networking device may still have pending transactions that can cause errors
or machine checks when the device is re-enabled. This can prevent the crash
dump kernel from loading the driver or collecting the crash data.

To avoid this issue, perform a function level reset (FLR) on the ice device
via PCIe config space before enabling it on the crash kernel. This will
clear any outstanding transactions and stop all queues and interrupts.
Restore the config space after the FLR, otherwise it was found in testing
that the driver wouldn't load successfully.

The following sequence causes the original issue:
- Load the ice driver with modprobe ice
- Enable SR-IOV with 2 VFs: echo 2 > /sys/class/net/eth0/device/sriov_num_vfs
- Trigger a crash with echo c > /proc/sysrq-trigger
- Load the ice driver again (or let it load automatically) with modprobe ice
- The system crashes again during pcim_enable_device()

Fixes: 837f08fdecbe ("ice: Add basic driver framework for Intel(R) E800 Series")

Reported-by: Vishal Agrawal <vagrawal@redhat.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
v3: add Fixes tag as approximate, added Jay's RB tag
v2: respond to list comments and update commit message
v1: initial version
---
 drivers/net/ethernet/intel/ice/ice_main.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)


base-commit: 6a70e5cbedaf8ad10528ac9ac114f3ec20f422df

Comments

Pucha, HimasekharX Reddy Oct. 11, 2023, 6:16 a.m. UTC | #1
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Jesse Brandeburg
> Sent: Wednesday, October 4, 2023 10:32 PM
> To: intel-wired-lan@lists.osuosl.org
> Cc: pmenzel@molgen.mpg.de; Vishal Agrawal <vagrawal@redhat.com>; linux-pci@vger.kernel.org; Brandeburg, Jesse <jesse.brandeburg@intel.com>; netdev@vger.kernel.org; jkc@redhat.com; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net v3] ice: reset first in crash dump kernels
>
> When the system boots into the crash dump kernel after a panic, the ice
> networking device may still have pending transactions that can cause errors
> or machine checks when the device is re-enabled. This can prevent the crash
> dump kernel from loading the driver or collecting the crash data.
>
> To avoid this issue, perform a function level reset (FLR) on the ice device
> via PCIe config space before enabling it on the crash kernel. This will
> clear any outstanding transactions and stop all queues and interrupts.
> Restore the config space after the FLR, otherwise it was found in testing
> that the driver wouldn't load successfully.
>
> The following sequence causes the original issue:
> - Load the ice driver with modprobe ice
> - Enable SR-IOV with 2 VFs: echo 2 > /sys/class/net/eth0/device/sriov_num_vfs
> - Trigger a crash with echo c > /proc/sysrq-trigger
> - Load the ice driver again (or let it load automatically) with modprobe ice
> - The system crashes again during pcim_enable_device()
>
> Fixes: 837f08fdecbe ("ice: Add basic driver framework for Intel(R) E800 Series")
>
> Reported-by: Vishal Agrawal <vagrawal@redhat.com>
> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
> v3: add Fixes tag as approximate, added Jay's RB tag
> v2: respond to list comments and update commit message
> v1: initial version
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>

Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index c8286adae946..6550c46e4e36 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -6,6 +6,7 @@ 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <generated/utsrelease.h>
+#include <linux/crash_dump.h>
 #include "ice.h"
 #include "ice_base.h"
 #include "ice_lib.h"
@@ -5014,6 +5015,20 @@  ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
 		return -EINVAL;
 	}
 
+	/* when under a kdump kernel initiate a reset before enabling the
+	 * device in order to clear out any pending DMA transactions. These
+	 * transactions can cause some systems to machine check when doing
+	 * the pcim_enable_device() below.
+	 */
+	if (is_kdump_kernel()) {
+		pci_save_state(pdev);
+		pci_clear_master(pdev);
+		err = pcie_flr(pdev);
+		if (err)
+			return err;
+		pci_restore_state(pdev);
+	}
+
 	/* this driver uses devres, see
 	 * Documentation/driver-api/driver-model/devres.rst
 	 */