Message ID | 20221105071028.578594-6-saeed@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [V2,net,01/11] net/mlx5: Bridge, verify LAG state when adding bond to bridge | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net |
netdev/fixes_present | success | Fixes tag present in non-next series |
netdev/subject_prefix | success | Link |
netdev/cover_letter | success | Pull request is its own cover letter |
netdev/patch_count | success | Link |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/cc_maintainers | success | CCed 5 of 5 maintainers |
netdev/build_clang | success | Errors and warnings before: 0 this patch: 0 |
netdev/module_param | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/check_selftest | success | No net selftest shell script |
netdev/verify_fixes | success | Fixes tag looks correct |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 30 lines checked |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
On Sat, 5 Nov 2022 00:10:22 -0700 Saeed Mahameed wrote: > + /* Once deactivated, new tx_timeout_work won't be initiated. */ > + if (current_work() != &priv->tx_timeout_work) > + cancel_work_sync(&priv->tx_timeout_work); The work takes rtnl_lock, are there no callers of mlx5e_switch_priv_channels() that are under rtnl_lock()? This patch is definitely going onto my "expecting Fixes" bingo card :S
On Mon, 2022-11-07 at 20:24 -0800, Jakub Kicinski wrote: > On Sat, 5 Nov 2022 00:10:22 -0700 Saeed Mahameed wrote: > > + /* Once deactivated, new tx_timeout_work won't be initiated. */ > > + if (current_work() != &priv->tx_timeout_work) > > + cancel_work_sync(&priv->tx_timeout_work); > > The work takes rtnl_lock, are there no callers of > mlx5e_switch_priv_channels() that are under rtnl_lock()? > > This patch is definitely going onto my "expecting Fixes" > bingo card :S I think Jakub is right and even mlx5e_close_locked() will deadlock on cancel_work_sync() if the work is scheduled but it has not yet acquired the rtnl lock. IIRC lockdep is not able to catch this kind of situation, so you can only observe the deadlock when reaching the critical scenario. I'm wild guessing than a possible solution would be restrict the state_lock scope in mlx5e_tx_timeout_work() around the state check, without additional cancel_work operations. Thanks, Paolo
On 08 Nov 11:19, Paolo Abeni wrote: >On Mon, 2022-11-07 at 20:24 -0800, Jakub Kicinski wrote: >> On Sat, 5 Nov 2022 00:10:22 -0700 Saeed Mahameed wrote: >> > + /* Once deactivated, new tx_timeout_work won't be initiated. */ >> > + if (current_work() != &priv->tx_timeout_work) >> > + cancel_work_sync(&priv->tx_timeout_work); >> >> The work takes rtnl_lock, are there no callers of >> mlx5e_switch_priv_channels() that are under rtnl_lock()? >> >> This patch is definitely going onto my "expecting Fixes" >> bingo card :S > >I think Jakub is right and even mlx5e_close_locked() will deadlock on >cancel_work_sync() if the work is scheduled but it has not yet acquired >the rtnl lock. Yes you are absolutely correct, you can see the deadlock just by looking at the patch diff and applying common sense that mlx5e_switch_priv_channels() is being called under rtnl. > >IIRC lockdep is not able to catch this kind of situation, so you can >only observe the deadlock when reaching the critical scenario. > >I'm wild guessing than a possible solution would be restrict the >state_lock scope in mlx5e_tx_timeout_work() around the state check, >without additional cancel_work operations. > Thanks, i will drop the patch for now and send v3 without it. >Thanks, > >Paolo >
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 364f04309149..09296f1ae448 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2954,6 +2954,9 @@ static int mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_carrier_off(netdev); mlx5e_deactivate_priv_channels(priv); + /* Once deactivated, new tx_timeout_work won't be initiated. */ + if (current_work() != &priv->tx_timeout_work) + cancel_work_sync(&priv->tx_timeout_work); old_chs = priv->channels; priv->channels = *new_chs; @@ -3106,6 +3109,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + cancel_work_sync(&priv->tx_timeout_work); mlx5e_close_channels(&priv->channels); return 0; @@ -4657,7 +4661,6 @@ static void mlx5e_tx_timeout_work(struct work_struct *work) int i; rtnl_lock(); - mutex_lock(&priv->state_lock); if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) goto unlock; @@ -4676,7 +4679,6 @@ static void mlx5e_tx_timeout_work(struct work_struct *work) } unlock: - mutex_unlock(&priv->state_lock); rtnl_unlock(); }