diff mbox series

[net] netdevsim: disable local BH when scheduling NAPI

Message ID 20250212-netdevsim-v1-1-20ece94daae8@debian.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] netdevsim: disable local BH when scheduling NAPI | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2025-02-12--21-00 (tests: 888)

Commit Message

Breno Leitao Feb. 12, 2025, 6:34 p.m. UTC
The netdevsim driver was getting NOHZ tick-stop errors during packet
transmission due to pending softirq work when calling napi_schedule().

This is showing the following message when running netconsole selftest.

	NOHZ tick-stop error: local softirq work is pending, handler #08!!!

Add local_bh_disable()/enable() around the napi_schedule() call to
prevent softirqs from being handled during this xmit.

Cc: stable@vger.kernel.org
Fixes: 3762ec05a9fb ("netdevsim: add NAPI support")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netdevsim/netdev.c | 2 ++
 1 file changed, 2 insertions(+)


---
base-commit: cf33d96f50903214226b379b3f10d1f262dae018
change-id: 20250212-netdevsim-258d2d628175

Best regards,

Comments

Eric Dumazet Feb. 12, 2025, 6:55 p.m. UTC | #1
On Wed, Feb 12, 2025 at 7:34 PM Breno Leitao <leitao@debian.org> wrote:
>
> The netdevsim driver was getting NOHZ tick-stop errors during packet
> transmission due to pending softirq work when calling napi_schedule().
>
> This is showing the following message when running netconsole selftest.
>
>         NOHZ tick-stop error: local softirq work is pending, handler #08!!!
>
> Add local_bh_disable()/enable() around the napi_schedule() call to
> prevent softirqs from being handled during this xmit.
>
> Cc: stable@vger.kernel.org
> Fixes: 3762ec05a9fb ("netdevsim: add NAPI support")
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  drivers/net/netdevsim/netdev.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 42f247cbdceecbadf27f7090c030aa5bd240c18a..6aeb081b06da226ab91c49f53d08f465570877ae 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -87,7 +87,9 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
>         if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP))
>                 goto out_drop_cnt;
>
> +       local_bh_disable();
>         napi_schedule(&rq->napi);
> +       local_bh_enable();
>

I thought all ndo_start_xmit() were done under local_bh_disable()

Could you give more details ?
Stanislav Fomichev Feb. 12, 2025, 10:05 p.m. UTC | #2
On 02/12, Eric Dumazet wrote:
> On Wed, Feb 12, 2025 at 7:34 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > The netdevsim driver was getting NOHZ tick-stop errors during packet
> > transmission due to pending softirq work when calling napi_schedule().
> >
> > This is showing the following message when running netconsole selftest.
> >
> >         NOHZ tick-stop error: local softirq work is pending, handler #08!!!
> >
> > Add local_bh_disable()/enable() around the napi_schedule() call to
> > prevent softirqs from being handled during this xmit.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 3762ec05a9fb ("netdevsim: add NAPI support")
> > Suggested-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  drivers/net/netdevsim/netdev.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> > index 42f247cbdceecbadf27f7090c030aa5bd240c18a..6aeb081b06da226ab91c49f53d08f465570877ae 100644
> > --- a/drivers/net/netdevsim/netdev.c
> > +++ b/drivers/net/netdevsim/netdev.c
> > @@ -87,7 +87,9 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP))
> >                 goto out_drop_cnt;
> >
> > +       local_bh_disable();
> >         napi_schedule(&rq->napi);
> > +       local_bh_enable();
> >
> 
> I thought all ndo_start_xmit() were done under local_bh_disable()
> 
> Could you give more details ?

Not 100% sure this patch is the culprit, but looks related:

https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/989901/5-netcons-fragmented-msg-sh/stderr

---
pw-bot: cr
Breno Leitao Feb. 14, 2025, 1:01 p.m. UTC | #3
Hello Eric,

On Wed, Feb 12, 2025 at 07:55:32PM +0100, Eric Dumazet wrote:
> On Wed, Feb 12, 2025 at 7:34 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > --- a/drivers/net/netdevsim/netdev.c
> > +++ b/drivers/net/netdevsim/netdev.c
> > @@ -87,7 +87,9 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP))
> >                 goto out_drop_cnt;
> >
> > +       local_bh_disable();
> >         napi_schedule(&rq->napi);
> > +       local_bh_enable();
> >
> 
> I thought all ndo_start_xmit() were done under local_bh_disable()

I think it depends on the path?

> Could you give more details ?

There are several paths to ndo_start_xmit(), and please correct me if
I am reading the code wrongly here.

Common path:

	__dev_direct_xmit()
		local_bh_disable();
			netdev_start_xmit()
				__netdev_start_xmit()
					ops->ndo_start_xmit(skb, dev);


But, in some other cases, I see:

	netpoll_start_xmit()
		netdev_start_xmit()
			....

My reading is that not all cases have local_bh_disable() disabled before
calling ndo_start_xmit().

Question: Must BH be disabled before calling ndo_start_xmit()? If so,
the problem might be in the netpoll code!? Also, is it worth adding
a DEBUG_NET_WARN_ON_ONCE()?

Note: Jakub gave another suggestion on how to fix this, so, I send a v2
with a different approach:

	https://lore.kernel.org/all/20250213071426.01490615@kernel.org/

Thanks for the review!
--breno
diff mbox series

Patch

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 42f247cbdceecbadf27f7090c030aa5bd240c18a..6aeb081b06da226ab91c49f53d08f465570877ae 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -87,7 +87,9 @@  static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP))
 		goto out_drop_cnt;
 
+	local_bh_disable();
 	napi_schedule(&rq->napi);
+	local_bh_enable();
 
 	rcu_read_unlock();
 	u64_stats_update_begin(&ns->syncp);