Message ID | 20250212174329.53793-2-frederic@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | net: Fix/prevent napi_schedule() call from bare task context | expand |
On Wed, 12 Feb 2025 18:43:28 +0100 Frederic Weisbecker wrote: > napi_schedule() is expected to be called either: > > * From an interrupt, where raised softirqs are handled on IRQ exit > > * From a softirq disabled section, where raised softirqs are handled on > the next call to local_bh_enable(). > > * From a softirq handler, where raised softirqs are handled on the next > round in do_softirq(), or further deferred to a dedicated kthread. > > Other bare tasks context may end up ignoring the raised NET_RX vector > until the next random softirq handling opportunity, which may not > happen before a while if the CPU goes idle afterwards with the tick > stopped. > > Report inappropriate calling contexts when neither of the three above > conditions are met. Looks like netcons is hitting this warning in netdevsim: [ 16.063196][ T219] nsim_start_xmit+0x4e0/0x6f0 [netdevsim] [ 16.063219][ T219] ? netif_skb_features+0x23e/0xa80 [ 16.063237][ T219] netpoll_start_xmit+0x3c3/0x670 [ 16.063258][ T219] __netpoll_send_skb+0x3e9/0x800 [ 16.063287][ T219] netpoll_send_skb+0x2a/0xa0 [ 16.063298][ T219] send_ext_msg_udp+0x286/0x350 [netconsole] [ 16.063325][ T219] write_ext_msg+0x1c6/0x230 [netconsole] [ 16.063346][ T219] console_emit_next_record+0x20d/0x430 https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/990261/7-netcons-basic-sh/stderr We gotta fix that first. Please post the fixes for net, and then the warning in net-next. So that we have some time to fix the uncovered warnings before users are broadly exposed to them.
On Wed, Feb 12, 2025 at 07:48:20PM -0800, Jakub Kicinski wrote: > On Wed, 12 Feb 2025 18:43:28 +0100 Frederic Weisbecker wrote: > > napi_schedule() is expected to be called either: > > > > * From an interrupt, where raised softirqs are handled on IRQ exit > > > > * From a softirq disabled section, where raised softirqs are handled on > > the next call to local_bh_enable(). > > > > * From a softirq handler, where raised softirqs are handled on the next > > round in do_softirq(), or further deferred to a dedicated kthread. > > > > Other bare tasks context may end up ignoring the raised NET_RX vector > > until the next random softirq handling opportunity, which may not > > happen before a while if the CPU goes idle afterwards with the tick > > stopped. > > > > Report inappropriate calling contexts when neither of the three above > > conditions are met. > > Looks like netcons is hitting this warning in netdevsim: > > [ 16.063196][ T219] nsim_start_xmit+0x4e0/0x6f0 [netdevsim] > [ 16.063219][ T219] ? netif_skb_features+0x23e/0xa80 > [ 16.063237][ T219] netpoll_start_xmit+0x3c3/0x670 > [ 16.063258][ T219] __netpoll_send_skb+0x3e9/0x800 > [ 16.063287][ T219] netpoll_send_skb+0x2a/0xa0 > [ 16.063298][ T219] send_ext_msg_udp+0x286/0x350 [netconsole] > [ 16.063325][ T219] write_ext_msg+0x1c6/0x230 [netconsole] > [ 16.063346][ T219] console_emit_next_record+0x20d/0x430 > > https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/990261/7-netcons-basic-sh/stderr > > We gotta fix that first. Thanks Jakub, I understand that it will be fixed by this patchset, right? https://lore.kernel.org/all/20250212-netdevsim-v1-1-20ece94daae8@debian.org/
On Thu, 13 Feb 2025 01:58:17 -0800 Breno Leitao wrote: > > Looks like netcons is hitting this warning in netdevsim: > > > > [ 16.063196][ T219] nsim_start_xmit+0x4e0/0x6f0 [netdevsim] > > [ 16.063219][ T219] ? netif_skb_features+0x23e/0xa80 > > [ 16.063237][ T219] netpoll_start_xmit+0x3c3/0x670 > > [ 16.063258][ T219] __netpoll_send_skb+0x3e9/0x800 > > [ 16.063287][ T219] netpoll_send_skb+0x2a/0xa0 > > [ 16.063298][ T219] send_ext_msg_udp+0x286/0x350 [netconsole] > > [ 16.063325][ T219] write_ext_msg+0x1c6/0x230 [netconsole] > > [ 16.063346][ T219] console_emit_next_record+0x20d/0x430 > > > > https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/990261/7-netcons-basic-sh/stderr > > > > We gotta fix that first. > > Thanks Jakub, > > I understand that it will be fixed by this patchset, right? The problem is a bit nasty, on a closer look. We don't know if netcons is called in IRQ context or not. How about we add an hrtimer to netdevsim, schedule it to fire 5usec in the future instead of scheduling NAPI immediately? We can call napi_schedule() from a timer safely. Unless there's another driver which schedules NAPI from xmit. Then we'd need to try harder to fix this in netpoll. veth does use NAPI on xmit but it sets IFF_DISABLE_NETPOLL already.
Hello Jakub, On Thu, Feb 13, 2025 at 07:14:26AM -0800, Jakub Kicinski wrote: > On Thu, 13 Feb 2025 01:58:17 -0800 Breno Leitao wrote: > > > Looks like netcons is hitting this warning in netdevsim: > > > > > > [ 16.063196][ T219] nsim_start_xmit+0x4e0/0x6f0 [netdevsim] > > > [ 16.063219][ T219] ? netif_skb_features+0x23e/0xa80 > > > [ 16.063237][ T219] netpoll_start_xmit+0x3c3/0x670 > > > [ 16.063258][ T219] __netpoll_send_skb+0x3e9/0x800 > > > [ 16.063287][ T219] netpoll_send_skb+0x2a/0xa0 > > > [ 16.063298][ T219] send_ext_msg_udp+0x286/0x350 [netconsole] > > > [ 16.063325][ T219] write_ext_msg+0x1c6/0x230 [netconsole] > > > [ 16.063346][ T219] console_emit_next_record+0x20d/0x430 > > > > > > https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/990261/7-netcons-basic-sh/stderr > > > > > > We gotta fix that first. > > > > Thanks Jakub, > > > > I understand that it will be fixed by this patchset, right? > > The problem is a bit nasty, on a closer look. We don't know if netcons > is called in IRQ context or not. How about we add an hrtimer to netdevsim, > schedule it to fire 5usec in the future instead of scheduling NAPI > immediately? We can call napi_schedule() from a timer safely. > > Unless there's another driver which schedules NAPI from xmit. > Then we'd need to try harder to fix this in netpoll. > veth does use NAPI on xmit but it sets IFF_DISABLE_NETPOLL already. Just to make sure I follow the netpoll issue. What would you like to fix in netpoll exactly?
On Thu, 13 Feb 2025 10:14:02 -0800 Breno Leitao wrote: > > The problem is a bit nasty, on a closer look. We don't know if netcons > > is called in IRQ context or not. How about we add an hrtimer to netdevsim, > > schedule it to fire 5usec in the future instead of scheduling NAPI > > immediately? We can call napi_schedule() from a timer safely. > > > > Unless there's another driver which schedules NAPI from xmit. > > Then we'd need to try harder to fix this in netpoll. > > veth does use NAPI on xmit but it sets IFF_DISABLE_NETPOLL already. > > Just to make sure I follow the netpoll issue. What would you like to fix > in netpoll exactly? Nothing in netpoll, the problem is that netdevsim calls napi_schedule from the xmit path. That's incompatible with netpoll. We should fix netdevsim instead (unless more real drivers need napi-from-xmit to work).
Le Thu, Feb 13, 2025 at 11:04:52AM -0800, Jakub Kicinski a écrit : > On Thu, 13 Feb 2025 10:14:02 -0800 Breno Leitao wrote: > > > The problem is a bit nasty, on a closer look. We don't know if netcons > > > is called in IRQ context or not. How about we add an hrtimer to netdevsim, > > > schedule it to fire 5usec in the future instead of scheduling NAPI > > > immediately? We can call napi_schedule() from a timer safely. > > > > > > Unless there's another driver which schedules NAPI from xmit. > > > Then we'd need to try harder to fix this in netpoll. > > > veth does use NAPI on xmit but it sets IFF_DISABLE_NETPOLL already. > > > > Just to make sure I follow the netpoll issue. What would you like to fix > > in netpoll exactly? > > Nothing in netpoll, the problem is that netdevsim calls napi_schedule > from the xmit path. That's incompatible with netpoll. We should fix > netdevsim instead (unless more real drivers need napi-from-xmit to > work). Let me clarify, because I don't know much this area. If the problem is that xmit can't call napi_schedule() by design, then I defer to you. But if the problem is that napi_schedule() may or may not be called from an interrupt, please note that local_bh_enable() won't run softirqs from a hardirq and will instead defer to IRQ tail. So it's fine to do an unconditional pair of local_bh_disable() / local_bh_enable(). Thanks.
Hi Jakub, Le Wed, Feb 12, 2025 at 07:48:20PM -0800, Jakub Kicinski a écrit : > On Wed, 12 Feb 2025 18:43:28 +0100 Frederic Weisbecker wrote: > Please post the fixes for net, and then the warning in net-next. > So that we have some time to fix the uncovered warnings before > users are broadly exposed to them. How do I define the patch target? Is it just about patch prefix? Thanks. > -- > pw-bot: cr
On Thu, Feb 13, 2025 at 11:04:52AM -0800, Jakub Kicinski wrote: > On Thu, 13 Feb 2025 10:14:02 -0800 Breno Leitao wrote: > > > The problem is a bit nasty, on a closer look. We don't know if netcons > > > is called in IRQ context or not. How about we add an hrtimer to netdevsim, > > > schedule it to fire 5usec in the future instead of scheduling NAPI > > > immediately? We can call napi_schedule() from a timer safely. > > > > > > Unless there's another driver which schedules NAPI from xmit. > > > Then we'd need to try harder to fix this in netpoll. > > > veth does use NAPI on xmit but it sets IFF_DISABLE_NETPOLL already. > > > > Just to make sure I follow the netpoll issue. What would you like to fix > > in netpoll exactly? > > Nothing in netpoll, the problem is that netdevsim calls napi_schedule Hm, you said the following above: Then we'd need to try harder to fix this in netpoll. I was curious about the meaning of that statement? Thanks --breno
Hello Jakub, On Thu, Feb 13, 2025 at 07:14:26AM -0800, Jakub Kicinski wrote: > ... How about we add an hrtimer to netdevsim, > schedule it to fire 5usec in the future instead of scheduling NAPI > immediately? We can call napi_schedule() from a timer safely. I hacked a way to do so. Is this what you had in mind? Author: Breno Leitao <leitao@debian.org> Date: Wed Feb 12 09:50:51 2025 -0800 netdevsim: call napi_schedule from a timer context The netdevsim driver was experiencing NOHZ tick-stop errors during packet transmission due to pending softirq work when calling napi_schedule(). This issue was observed when running the netconsole selftest, which triggered the following error message: NOHZ tick-stop error: local softirq work is pending, handler #08!!! To fix this issue, introduce a timer that schedules napi_schedule() from a timer context instead of calling it directly from the TX path. Create an hrtimer for each queue and kick it from the TX path, which then schedules napi_schedule() from the timer context. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index 42f247cbdceec..cd56904a39049 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -87,7 +87,7 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP)) goto out_drop_cnt; - napi_schedule(&rq->napi); + hrtimer_start(&rq->napi_timer, ns_to_ktime(5), HRTIMER_MODE_REL); rcu_read_unlock(); u64_stats_update_begin(&ns->syncp); @@ -426,6 +426,25 @@ static int nsim_init_napi(struct netdevsim *ns) return err; } +static enum hrtimer_restart nsim_napi_schedule(struct hrtimer *timer) +{ + struct nsim_rq *rq; + + rq = container_of(timer, struct nsim_rq, napi_timer); + napi_schedule(&rq->napi); + /* TODO: Should HRTIMER_RESTART be returned if napi_schedule returns + * false? + */ + + return HRTIMER_NORESTART; +} + +static void nsim_rq_timer_init(struct nsim_rq *rq) +{ + hrtimer_init(&rq->napi_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + rq->napi_timer.function = nsim_napi_schedule; +} + static void nsim_enable_napi(struct netdevsim *ns) { struct net_device *dev = ns->netdev; @@ -436,6 +455,7 @@ static void nsim_enable_napi(struct netdevsim *ns) netif_queue_set_napi(dev, i, NETDEV_QUEUE_TYPE_RX, &rq->napi); napi_enable(&rq->napi); + nsim_rq_timer_init(rq); } } @@ -461,6 +481,7 @@ static void nsim_del_napi(struct netdevsim *ns) for (i = 0; i < dev->num_rx_queues; i++) { struct nsim_rq *rq = ns->rq[i]; + hrtimer_cancel(&rq->napi_timer); napi_disable(&rq->napi); __netif_napi_del(&rq->napi); } diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h index dcf073bc4802e..2b396c517ac1d 100644 --- a/drivers/net/netdevsim/netdevsim.h +++ b/drivers/net/netdevsim/netdevsim.h @@ -97,6 +97,7 @@ struct nsim_rq { struct napi_struct napi; struct sk_buff_head skb_queue; struct page_pool *page_pool; + struct hrtimer napi_timer; }; struct netdevsim {
On Thu, 13 Feb 2025 21:38:12 +0100 Frederic Weisbecker wrote: > > > Just to make sure I follow the netpoll issue. What would you like to fix > > > in netpoll exactly? > > > > Nothing in netpoll, the problem is that netdevsim calls napi_schedule > > from the xmit path. That's incompatible with netpoll. We should fix > > netdevsim instead (unless more real drivers need napi-from-xmit to > > work). > > Let me clarify, because I don't know much this area. If the problem is that xmit > can't call napi_schedule() by design, then I defer to you. But if the problem is that > napi_schedule() may or may not be called from an interrupt, please note that > local_bh_enable() won't run softirqs from a hardirq and will instead defer to > IRQ tail. So it's fine to do an unconditional pair of local_bh_disable() / local_bh_enable(). I don't know where this is in the code TBH, but my understanding is that HW IRQs - yes, as you say it'd be safe; the problem is that we have local_irq_save() all over the place. And that is neither protected from local_bh_enable(), not does irq_restore execute softirqs.
On Thu, 13 Feb 2025 21:39:51 +0100 Frederic Weisbecker wrote: > Le Wed, Feb 12, 2025 at 07:48:20PM -0800, Jakub Kicinski a écrit : > > On Wed, 12 Feb 2025 18:43:28 +0100 Frederic Weisbecker wrote: > > Please post the fixes for net, and then the warning in net-next. > > So that we have some time to fix the uncovered warnings before > > users are broadly exposed to them. > > How do I define the patch target? Is it just about patch prefix? Yes, --subject-prefix="PATCH net" for fixes --subject-prefix="PATCH net-next" for -next stuff But even that isn't really a super hard requirement. My main ask was to post patch 2, then wait for it to propagate (we merge the trees every Thu afternoon) then post patch 1. Sorry for the delay.
On Fri, 14 Feb 2025 08:43:28 -0800 Breno Leitao wrote: > diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c > index 42f247cbdceec..cd56904a39049 100644 > --- a/drivers/net/netdevsim/netdev.c > +++ b/drivers/net/netdevsim/netdev.c > @@ -87,7 +87,7 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) > if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP)) > goto out_drop_cnt; > > - napi_schedule(&rq->napi); > + hrtimer_start(&rq->napi_timer, ns_to_ktime(5), HRTIMER_MODE_REL); ns -> us We want to leave the timer be in case it's already scheduled. Otherwise we'll keep postponing forever under load. Double check that hrtime_start() does not reset the time if already pending. Maybe hrtimer_start_range_ns(..., 0, us_to_ktime(5), ...) would work? > rcu_read_unlock(); > u64_stats_update_begin(&ns->syncp); > @@ -426,6 +426,25 @@ static int nsim_init_napi(struct netdevsim *ns) > return err; > } > > +static enum hrtimer_restart nsim_napi_schedule(struct hrtimer *timer) > +{ > + struct nsim_rq *rq; > + > + rq = container_of(timer, struct nsim_rq, napi_timer); > + napi_schedule(&rq->napi); > + /* TODO: Should HRTIMER_RESTART be returned if napi_schedule returns > + * false? > + */ I think not, ignore the return value > + return HRTIMER_NORESTART; > +}
Le Fri, Feb 14, 2025 at 02:00:45PM -0800, Jakub Kicinski a écrit : > On Thu, 13 Feb 2025 21:38:12 +0100 Frederic Weisbecker wrote: > > > > Just to make sure I follow the netpoll issue. What would you like to fix > > > > in netpoll exactly? > > > > > > Nothing in netpoll, the problem is that netdevsim calls napi_schedule > > > from the xmit path. That's incompatible with netpoll. We should fix > > > netdevsim instead (unless more real drivers need napi-from-xmit to > > > work). > > > > Let me clarify, because I don't know much this area. If the problem is that xmit > > can't call napi_schedule() by design, then I defer to you. But if the problem is that > > napi_schedule() may or may not be called from an interrupt, please note that > > local_bh_enable() won't run softirqs from a hardirq and will instead defer to > > IRQ tail. So it's fine to do an unconditional pair of local_bh_disable() / local_bh_enable(). > > I don't know where this is in the code TBH, but my understanding is > that HW IRQs - yes, as you say it'd be safe; the problem is that > we have local_irq_save() all over the place. And that is neither > protected from local_bh_enable(), not does irq_restore execute softirqs. Yeah actually checking local_bh_enable() again, it's not safe to call within a hardirq. Ok I've been thinking some more and how about this instead? diff --git a/net/core/dev.c b/net/core/dev.c index c0021cbd28fc..2419cc558a64 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4692,7 +4692,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, * we have to raise NET_RX_SOFTIRQ. */ if (!sd->in_net_rx_action) - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); } #ifdef CONFIG_RPS This will simply wake up ksoftirqd if called from a non-IRQ. I expect such callers to be rare enough to not impact performances and it has the advantage to work for everyone. Thanks.
On Fri, Feb 14, 2025 at 02:10:11PM -0800, Jakub Kicinski wrote: > On Fri, 14 Feb 2025 08:43:28 -0800 Breno Leitao wrote: > > diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c > > index 42f247cbdceec..cd56904a39049 100644 > > --- a/drivers/net/netdevsim/netdev.c > > +++ b/drivers/net/netdevsim/netdev.c > > @@ -87,7 +87,7 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev) > > if (unlikely(nsim_forward_skb(peer_dev, skb, rq) == NET_RX_DROP)) > > goto out_drop_cnt; > > > > - napi_schedule(&rq->napi); > > + hrtimer_start(&rq->napi_timer, ns_to_ktime(5), HRTIMER_MODE_REL); > > ns -> us > > We want to leave the timer be in case it's already scheduled. > Otherwise we'll keep postponing forever under load. > Double check that hrtime_start() does not reset the time if already > pending. Maybe hrtimer_start_range_ns(..., 0, us_to_ktime(5), ...) > would work? Reading the code, I got the impression that hrtimer_start_range_ns(..., 0, us_to_ktime(5), ...) will reprogram the timer anyway. hrtimer_start_range_ns() { __hrtimer_start_range_ns() { remove_hrtimer(timer, base, true, force_local); enqueue_hrtimer(timer, new_base, mode); ... } } I think a better solution is to do something as: if (!hrtimer_active(&rq->napi_timer)) hrtimer_start(&rq->napi_timer, us_to_ktime(5), HRTIMER_MODE_REL);
On Mon, 17 Feb 2025 08:46:21 -0800 Breno Leitao wrote: > I think a better solution is to do something as: > > if (!hrtimer_active(&rq->napi_timer)) > hrtimer_start(&rq->napi_timer, us_to_ktime(5), HRTIMER_MODE_REL); sounds good
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 67964dc4db95..1bd730b881f0 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -619,6 +619,17 @@ do { \ (!in_softirq() || in_irq() || in_nmi())); \ } while (0) +/* + * Assert to be either in hardirq or in serving softirq or with + * softirqs disabled. Verifies a safe context to queue a softirq + * with __raise_softirq_irqoff(). + */ +#define lockdep_assert_in_interrupt() \ +do { \ + WARN_ON_ONCE(__lockdep_enabled && !in_interrupt()); \ +} while (0) + + extern void lockdep_assert_in_softirq_func(void); #else @@ -634,6 +645,7 @@ extern void lockdep_assert_in_softirq_func(void); # define lockdep_assert_preemption_enabled() do { } while (0) # define lockdep_assert_preemption_disabled() do { } while (0) # define lockdep_assert_in_softirq() do { } while (0) +# define lockdep_assert_in_interrupt() do { } while (0) # define lockdep_assert_in_softirq_func() do { } while (0) #endif diff --git a/net/core/dev.c b/net/core/dev.c index c0021cbd28fc..80e415ccf2c8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4666,6 +4666,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, struct task_struct *thread; lockdep_assert_irqs_disabled(); + lockdep_assert_in_interrupt(); if (test_bit(NAPI_STATE_THREADED, &napi->state)) { /* Paired with smp_mb__before_atomic() in
napi_schedule() is expected to be called either: * From an interrupt, where raised softirqs are handled on IRQ exit * From a softirq disabled section, where raised softirqs are handled on the next call to local_bh_enable(). * From a softirq handler, where raised softirqs are handled on the next round in do_softirq(), or further deferred to a dedicated kthread. Other bare tasks context may end up ignoring the raised NET_RX vector until the next random softirq handling opportunity, which may not happen before a while if the CPU goes idle afterwards with the tick stopped. Report inappropriate calling contexts when neither of the three above conditions are met. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- include/linux/lockdep.h | 12 ++++++++++++ net/core/dev.c | 1 + 2 files changed, 13 insertions(+)