diff mbox series

[net,1/4] ibmvnic: Allow extra failures before disabling

Message ID 20220122025921.199446-1-sukadev@linux.ibm.com (mailing list archive)
State Accepted
Commit db9f0e8bf79e6da7068b5818fea0ffd9d0d4b4da
Delegated to: Netdev Maintainers
Headers show
Series [net,1/4] ibmvnic: Allow extra failures before disabling | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter warning Series does not have a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 1 blamed authors not CCed: kuba@kernel.org; 7 maintainers not CCed: paulus@samba.org kuba@kernel.org benh@kernel.crashing.org mpe@ellerman.id.au linuxppc-dev@lists.ozlabs.org davem@davemloft.net tlfalcon@linux.ibm.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Sukadev Bhattiprolu Jan. 22, 2022, 2:59 a.m. UTC
If auto-priority-failover (APF) is enabled and there are at least two
backing devices of different priorities, some resets like fail-over,
change-param etc can cause at least two back to back failovers. (Failover
from high priority backing device to lower priority one and then back
to the higher priority one if that is still functional).

Depending on the timimg of the two failovers it is possible to trigger
a "hard" reset and for the hard reset to fail due to failovers. When this
occurs, the driver assumes that the network is unstable and disables the
VNIC for a 60-second "settling time". This in turn can cause the ethtool
command to fail with "No such device" while the vnic automatically recovers
a little while later.

Given that it's possible to have two back to back failures, allow for extra
failures before disabling the vnic for the settling time.

Fixes: f15fde9d47b8 ("ibmvnic: delay next reset if hard reset fails")
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Comments

Dany Madden Jan. 23, 2022, 12:22 a.m. UTC | #1
On 2022-01-21 18:59, Sukadev Bhattiprolu wrote:
> If auto-priority-failover (APF) is enabled and there are at least two
> backing devices of different priorities, some resets like fail-over,
> change-param etc can cause at least two back to back failovers. 
> (Failover
> from high priority backing device to lower priority one and then back
> to the higher priority one if that is still functional).
> 
> Depending on the timimg of the two failovers it is possible to trigger
> a "hard" reset and for the hard reset to fail due to failovers. When 
> this
> occurs, the driver assumes that the network is unstable and disables 
> the
> VNIC for a 60-second "settling time". This in turn can cause the 
> ethtool
> command to fail with "No such device" while the vnic automatically 
> recovers
> a little while later.
> 
> Given that it's possible to have two back to back failures, allow for 
> extra
> failures before disabling the vnic for the settling time.
> 
> Fixes: f15fde9d47b8 ("ibmvnic: delay next reset if hard reset fails")
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>

> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index 0bb3911dd014..9b2d16ad76f1 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -2598,6 +2598,7 @@ static void __ibmvnic_reset(struct work_struct 
> *work)
>  	struct ibmvnic_rwi *rwi;
>  	unsigned long flags;
>  	u32 reset_state;
> +	int num_fails = 0;
>  	int rc = 0;
> 
>  	adapter = container_of(work, struct ibmvnic_adapter, ibmvnic_reset);
> @@ -2651,11 +2652,23 @@ static void __ibmvnic_reset(struct work_struct 
> *work)
>  				rc = do_hard_reset(adapter, rwi, reset_state);
>  				rtnl_unlock();
>  			}
> -			if (rc) {
> -				/* give backing device time to settle down */
> +			if (rc)
> +				num_fails++;
> +			else
> +				num_fails = 0;
> +
> +			/* If auto-priority-failover is enabled we can get
> +			 * back to back failovers during resets, resulting
> +			 * in at least two failed resets (from high-priority
> +			 * backing device to low-priority one and then back)
> +			 * If resets continue to fail beyond that, give the
> +			 * adapter some time to settle down before retrying.
> +			 */
> +			if (num_fails >= 3) {
>  				netdev_dbg(adapter->netdev,
> -					   "[S:%s] Hard reset failed, waiting 60 secs\n",
> -					   adapter_state_to_string(adapter->state));
> +					   "[S:%s] Hard reset failed %d times, waiting 60 secs\n",
> +					   adapter_state_to_string(adapter->state),
> +					   num_fails);
>  				set_current_state(TASK_UNINTERRUPTIBLE);
>  				schedule_timeout(60 * HZ);
>  			}
patchwork-bot+netdevbpf@kernel.org Jan. 24, 2022, 12:10 p.m. UTC | #2
Hello:

This series was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Fri, 21 Jan 2022 18:59:18 -0800 you wrote:
> If auto-priority-failover (APF) is enabled and there are at least two
> backing devices of different priorities, some resets like fail-over,
> change-param etc can cause at least two back to back failovers. (Failover
> from high priority backing device to lower priority one and then back
> to the higher priority one if that is still functional).
> 
> Depending on the timimg of the two failovers it is possible to trigger
> a "hard" reset and for the hard reset to fail due to failovers. When this
> occurs, the driver assumes that the network is unstable and disables the
> VNIC for a 60-second "settling time". This in turn can cause the ethtool
> command to fail with "No such device" while the vnic automatically recovers
> a little while later.
> 
> [...]

Here is the summary with links:
  - [net,1/4] ibmvnic: Allow extra failures before disabling
    https://git.kernel.org/netdev/net/c/db9f0e8bf79e
  - [net,2/4] ibmvnic: init ->running_cap_crqs early
    https://git.kernel.org/netdev/net/c/151b6a5c06b6
  - [net,3/4] ibmvnic: don't spin in tasklet
    https://git.kernel.org/netdev/net/c/48079e7fdd02
  - [net,4/4] ibmvnic: remove unused ->wait_capability
    https://git.kernel.org/netdev/net/c/3a5d9db7fbdf

You are awesome, thank you!
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 0bb3911dd014..9b2d16ad76f1 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2598,6 +2598,7 @@  static void __ibmvnic_reset(struct work_struct *work)
 	struct ibmvnic_rwi *rwi;
 	unsigned long flags;
 	u32 reset_state;
+	int num_fails = 0;
 	int rc = 0;
 
 	adapter = container_of(work, struct ibmvnic_adapter, ibmvnic_reset);
@@ -2651,11 +2652,23 @@  static void __ibmvnic_reset(struct work_struct *work)
 				rc = do_hard_reset(adapter, rwi, reset_state);
 				rtnl_unlock();
 			}
-			if (rc) {
-				/* give backing device time to settle down */
+			if (rc)
+				num_fails++;
+			else
+				num_fails = 0;
+
+			/* If auto-priority-failover is enabled we can get
+			 * back to back failovers during resets, resulting
+			 * in at least two failed resets (from high-priority
+			 * backing device to low-priority one and then back)
+			 * If resets continue to fail beyond that, give the
+			 * adapter some time to settle down before retrying.
+			 */
+			if (num_fails >= 3) {
 				netdev_dbg(adapter->netdev,
-					   "[S:%s] Hard reset failed, waiting 60 secs\n",
-					   adapter_state_to_string(adapter->state));
+					   "[S:%s] Hard reset failed %d times, waiting 60 secs\n",
+					   adapter_state_to_string(adapter->state),
+					   num_fails);
 				set_current_state(TASK_UNINTERRUPTIBLE);
 				schedule_timeout(60 * HZ);
 			}