diff mbox series

[v2] net: raise RCU qs after each threaded NAPI poll

Message ID ZeFPz4D121TgvCje@debian.debian (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v2] net: raise RCU qs after each threaded NAPI poll | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 947 this patch: 947
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 963 this patch: 963
netdev/checkpatch warning WARNING: Commit log lines starting with '#' are dropped by git as comments
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-02--12-00 (tests: 884)

Commit Message

Yan Zhai March 1, 2024, 3:47 a.m. UTC
We noticed task RCUs being blocked when threaded NAPIs are very busy at
workloads: detaching any BPF tracing programs, i.e. removing a ftrace
trampoline, will simply block for very long in rcu_tasks_wait_gp. This
ranges from hundreds of seconds to even an hour, severely harming any
observability tools that rely on BPF tracing programs. It can be
easily reproduced locally with following setup:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
 for j in `seq 10 210`; do
    ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
 done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Above reproduce for net-next kernel with following RCU and preempt
configuraitons:

# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem
# RCU Debugging
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

An interesting observation is that, while tasks RCUs are blocked,
related NAPI thread is still being scheduled (even across cores)
regularly. Looking at the gp conditions, I am inclining to cond_resched
after each __napi_poll being the problem: cond_resched enters the
scheduler with PREEMPT bit, which does not account as a gp for tasks
RCUs. Meanwhile, since the thread has been frequently resched, the
normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
seems to have very little chance to kick in. Given the nature of "busy
polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
updated (other gp conditions), the result is that such NAPI thread is
put on RCU holdouts list for indefinitely long time.

This is simply fixed by adapting similar behavior of ksoftirqd: after
the thread repolls for a while, raise a RCU QS to help expedite the
tasks RCU grace period. No more blocking afterwards.

Some brief iperf3 throughput testing in my VM with net-next kernel shows no
noteable perf difference with 1500 byte MTU for 10 repeat runs each:

Before:
UDP:  3.073Gbps (+-0.070Gbps)
TCP: 37.850Gbps (+-1.947Gbps)

After:
UDP:  3.077Gbps (+-0.121 Gbps)
TCP: 38.120Gbps (+-2.272 Gbps)

Note I didn't enable GRO for UDP so its throughput is lower than TCP.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v1->v2: moved rcu_softirq_qs out from bh critical section, and only
raise it after a second of repolling. Added some brief perf test result.

---
 net/core/dev.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Yan Zhai March 1, 2024, 3:49 a.m. UTC | #1
On Thu, Feb 29, 2024 at 9:47 PM Yan Zhai <yan@cloudflare.com> wrote:
>
> We noticed task RCUs being blocked when threaded NAPIs are very busy at
> workloads: detaching any BPF tracing programs, i.e. removing a ftrace
> trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> ranges from hundreds of seconds to even an hour, severely harming any
> observability tools that rely on BPF tracing programs. It can be
> easily reproduced locally with following setup:
>
> ip netns add test1
> ip netns add test2
>
> ip -n test1 link add veth1 type veth peer name veth2 netns test2
>
> ip -n test1 link set veth1 up
> ip -n test1 link set lo up
> ip -n test2 link set veth2 up
> ip -n test2 link set lo up
>
> ip -n test1 addr add 192.168.1.2/31 dev veth1
> ip -n test1 addr add 1.1.1.1/32 dev lo
> ip -n test2 addr add 192.168.1.3/31 dev veth2
> ip -n test2 addr add 2.2.2.2/31 dev lo
>
> ip -n test1 route add default via 192.168.1.3
> ip -n test2 route add default via 192.168.1.2
>
> for i in `seq 10 210`; do
>  for j in `seq 10 210`; do
>     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>  done
> done
>
> ip netns exec test2 ethtool -K veth2 gro on
> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> ip netns exec test1 ethtool -K veth1 tso off
>
> Then run an iperf3 client/server and a bpftrace script can trigger it:
>
> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>
> Above reproduce for net-next kernel with following RCU and preempt
> configuraitons:
>
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # end of RCU Subsystem
> # RCU Debugging
> # CONFIG_RCU_SCALE_TEST is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_REF_SCALE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> # CONFIG_RCU_TRACE is not set
> # CONFIG_RCU_EQS_DEBUG is not set
> # end of RCU Debugging
>
> CONFIG_PREEMPT_BUILD=y
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_DYNAMIC=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_HAVE_PREEMPT_DYNAMIC=y
> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_DEBUG_PREEMPT is not set
> # CONFIG_PREEMPT_TRACER is not set
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>
> An interesting observation is that, while tasks RCUs are blocked,
> related NAPI thread is still being scheduled (even across cores)
> regularly. Looking at the gp conditions, I am inclining to cond_resched
> after each __napi_poll being the problem: cond_resched enters the
> scheduler with PREEMPT bit, which does not account as a gp for tasks
> RCUs. Meanwhile, since the thread has been frequently resched, the
> normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> seems to have very little chance to kick in. Given the nature of "busy
> polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> updated (other gp conditions), the result is that such NAPI thread is
> put on RCU holdouts list for indefinitely long time.
>
> This is simply fixed by adapting similar behavior of ksoftirqd: after
> the thread repolls for a while, raise a RCU QS to help expedite the
> tasks RCU grace period. No more blocking afterwards.
>
> Some brief iperf3 throughput testing in my VM with net-next kernel shows no
> noteable perf difference with 1500 byte MTU for 10 repeat runs each:
>
> Before:
> UDP:  3.073Gbps (+-0.070Gbps)
> TCP: 37.850Gbps (+-1.947Gbps)
>
> After:
> UDP:  3.077Gbps (+-0.121 Gbps)
> TCP: 38.120Gbps (+-2.272 Gbps)
>
> Note I didn't enable GRO for UDP so its throughput is lower than TCP.
>
> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>
> ---
> v1->v2: moved rcu_softirq_qs out from bh critical section, and only
> raise it after a second of repolling. Added some brief perf test result.
>
Link to v1: https://lore.kernel.org/netdev/Zd4DXTyCf17lcTfq@debian.debian/T/#u
And I apparently forgot to rename the subject since it's not raising
after every poll (let me know if it is prefered to send a V3 to fix
it)

thanks
Yan

> ---
>  net/core/dev.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 275fd5259a4a..76cff3849e1f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6751,9 +6751,12 @@ static int napi_threaded_poll(void *data)
>  {
>         struct napi_struct *napi = data;
>         struct softnet_data *sd;
> +       unsigned long next_qs;
>         void *have;
>
>         while (!napi_thread_wait(napi)) {
> +               next_qs = jiffies + HZ;
> +
>                 for (;;) {
>                         bool repoll = false;
>
> @@ -6778,6 +6781,21 @@ static int napi_threaded_poll(void *data)
>                         if (!repoll)
>                                 break;
>
> +                       /* cond_resched cannot unblock tasks RCU writers, so it
> +                        * is necessary to relax periodically and raise a QS to
> +                        * avoid starving writers under frequent repoll, e.g.
> +                        * ftrace trampoline clean up work. When not repoll,
> +                        * napi_thread_wait will enter sleep and have the same
> +                        * QS effect.
> +                        */
> +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
> +                           time_after(jiffies, next_qs)) {
> +                               preempt_disable();
> +                               rcu_softirq_qs();
> +                               preempt_enable();
> +                               next_qs = jiffies + HZ;
> +                       }
> +
>                         cond_resched();
>                 }
>         }
> --
> 2.30.2
>
Eric Dumazet March 1, 2024, 8:30 a.m. UTC | #2
On Fri, Mar 1, 2024 at 4:50 AM Yan Zhai <yan@cloudflare.com> wrote:
>
> On Thu, Feb 29, 2024 at 9:47 PM Yan Zhai <yan@cloudflare.com> wrote:
> >
> > We noticed task RCUs being blocked when threaded NAPIs are very busy at
> > workloads: detaching any BPF tracing programs, i.e. removing a ftrace
> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > ranges from hundreds of seconds to even an hour, severely harming any
...
> >
> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > ---
> > v1->v2: moved rcu_softirq_qs out from bh critical section, and only
> > raise it after a second of repolling. Added some brief perf test result.
> >
> Link to v1: https://lore.kernel.org/netdev/Zd4DXTyCf17lcTfq@debian.debian/T/#u
> And I apparently forgot to rename the subject since it's not raising
> after every poll (let me know if it is prefered to send a V3 to fix
> it)
>

I could not see the reason for 1sec (HZ) delays.

Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ?

In anycase, if this all about rcu_tasks, I would prefer using a macro
defined in kernel/rcu/tasks.h
instead of having a hidden constant in a networking core function.

Thanks.
Yan Zhai March 1, 2024, 5:30 p.m. UTC | #3
Hi Eric,

On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> I could not see the reason for 1sec (HZ) delays.
>
> Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ?
>
The trouble scenarios are often when we need to detach an ad-hoc BPF
tracing program, or restart a monitoring service. It is fine as long
as they do not block for 10+ seconds or even completely stall under
heavy traffic. Raising a QS every few ms or HZ both work in such
cases.

> In anycase, if this all about rcu_tasks, I would prefer using a macro
> defined in kernel/rcu/tasks.h
> instead of having a hidden constant in a networking core function.

Paul E. McKenney was suggesting either current form or

         local_bh_enable();
         if (!IS_ENABLED(CONFIG_PREEMPT_RT))
                 rcu_softirq_qs_enable(local_bh_enable());
         else
                 local_bh_enable();

With an interval it might have to be
"rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an
unnecessary extern/static var. Will it make more sense to you?

thanks

>
> Thanks.
Paul E. McKenney March 1, 2024, 10:29 p.m. UTC | #4
On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote:
> Hi Eric,
> 
> On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > I could not see the reason for 1sec (HZ) delays.
> >
> > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ?
> >
> The trouble scenarios are often when we need to detach an ad-hoc BPF
> tracing program, or restart a monitoring service. It is fine as long
> as they do not block for 10+ seconds or even completely stall under
> heavy traffic. Raising a QS every few ms or HZ both work in such
> cases.
> 
> > In anycase, if this all about rcu_tasks, I would prefer using a macro
> > defined in kernel/rcu/tasks.h
> > instead of having a hidden constant in a networking core function.
> 
> Paul E. McKenney was suggesting either current form or
> 
>          local_bh_enable();
>          if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>                  rcu_softirq_qs_enable(local_bh_enable());
>          else
>                  local_bh_enable();
> 
> With an interval it might have to be
> "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an
> unnecessary extern/static var. Will it make more sense to you?

I was thinking in terms of something like this (untested):

	#define rcu_softirq_qs_enable(enable_stmt, oldj) \
	do { \
		if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
		    time_after(oldj + HZ / 10, jiffies) { \
			rcu_softirq_qs(); \
			(oldj) = jiffies; \
		} \
		do  { enable_stmt; } while (0) \
	} while (0)

Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)",
where last_qs is initialized by the caller to jiffies.

The reason for putting "enable_stmt;" into anothor do-while loop is
in case someone typos an "else" as the first part of the "enable_stmt"
argument.

Would that work?

							Thanx, Paul
Yan Zhai March 11, 2024, 10:58 p.m. UTC | #5
On Fri, Mar 1, 2024 at 4:29 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote:
> > Hi Eric,
> >
> > On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > I could not see the reason for 1sec (HZ) delays.
> > >
> > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ?
> > >
> > The trouble scenarios are often when we need to detach an ad-hoc BPF
> > tracing program, or restart a monitoring service. It is fine as long
> > as they do not block for 10+ seconds or even completely stall under
> > heavy traffic. Raising a QS every few ms or HZ both work in such
> > cases.
> >
> > > In anycase, if this all about rcu_tasks, I would prefer using a macro
> > > defined in kernel/rcu/tasks.h
> > > instead of having a hidden constant in a networking core function.
> >
> > Paul E. McKenney was suggesting either current form or
> >
> >          local_bh_enable();
> >          if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> >                  rcu_softirq_qs_enable(local_bh_enable());
> >          else
> >                  local_bh_enable();
> >
> > With an interval it might have to be
> > "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an
> > unnecessary extern/static var. Will it make more sense to you?
>
> I was thinking in terms of something like this (untested):
>
>         #define rcu_softirq_qs_enable(enable_stmt, oldj) \
>         do { \
>                 if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
>                     time_after(oldj + HZ / 10, jiffies) { \
>                         rcu_softirq_qs(); \
>                         (oldj) = jiffies; \
>                 } \
>                 do  { enable_stmt; } while (0) \
>         } while (0)
>
> Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)",
> where last_qs is initialized by the caller to jiffies.
>
> The reason for putting "enable_stmt;" into anothor do-while loop is
> in case someone typos an "else" as the first part of the "enable_stmt"
> argument.
>
> Would that work?
>
Thanks Paul, just got time to continue this thread as I was
travelling. I think it is probably better to move
preempt_disable/enable into the macro to avoid the friction. And also
since this can affect NAPI thread, NAPI busy loop and XDP cpu map
thread (+Jesper who reminded me about this), let me send a v3 later to
cover all of those places.

Yan


>                                                         Thanx, Paul
Paul E. McKenney March 11, 2024, 11:55 p.m. UTC | #6
On Mon, Mar 11, 2024 at 05:58:16PM -0500, Yan Zhai wrote:
> On Fri, Mar 1, 2024 at 4:29 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote:
> > > Hi Eric,
> > >
> > > On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > I could not see the reason for 1sec (HZ) delays.
> > > >
> > > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ?
> > > >
> > > The trouble scenarios are often when we need to detach an ad-hoc BPF
> > > tracing program, or restart a monitoring service. It is fine as long
> > > as they do not block for 10+ seconds or even completely stall under
> > > heavy traffic. Raising a QS every few ms or HZ both work in such
> > > cases.
> > >
> > > > In anycase, if this all about rcu_tasks, I would prefer using a macro
> > > > defined in kernel/rcu/tasks.h
> > > > instead of having a hidden constant in a networking core function.
> > >
> > > Paul E. McKenney was suggesting either current form or
> > >
> > >          local_bh_enable();
> > >          if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > >                  rcu_softirq_qs_enable(local_bh_enable());
> > >          else
> > >                  local_bh_enable();
> > >
> > > With an interval it might have to be
> > > "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an
> > > unnecessary extern/static var. Will it make more sense to you?
> >
> > I was thinking in terms of something like this (untested):
> >
> >         #define rcu_softirq_qs_enable(enable_stmt, oldj) \
> >         do { \
> >                 if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> >                     time_after(oldj + HZ / 10, jiffies) { \
> >                         rcu_softirq_qs(); \
> >                         (oldj) = jiffies; \
> >                 } \
> >                 do  { enable_stmt; } while (0) \
> >         } while (0)
> >
> > Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)",
> > where last_qs is initialized by the caller to jiffies.
> >
> > The reason for putting "enable_stmt;" into anothor do-while loop is
> > in case someone typos an "else" as the first part of the "enable_stmt"
> > argument.
> >
> > Would that work?
> >
> Thanks Paul, just got time to continue this thread as I was
> travelling. I think it is probably better to move
> preempt_disable/enable into the macro to avoid the friction. And also
> since this can affect NAPI thread, NAPI busy loop and XDP cpu map
> thread (+Jesper who reminded me about this), let me send a v3 later to
> cover all of those places.

OK, looking forward to seeing what you come up with.

							Thanx, Paul
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 275fd5259a4a..76cff3849e1f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6751,9 +6751,12 @@  static int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
 	struct softnet_data *sd;
+	unsigned long next_qs;
 	void *have;
 
 	while (!napi_thread_wait(napi)) {
+		next_qs = jiffies + HZ;
+
 		for (;;) {
 			bool repoll = false;
 
@@ -6778,6 +6781,21 @@  static int napi_threaded_poll(void *data)
 			if (!repoll)
 				break;
 
+			/* cond_resched cannot unblock tasks RCU writers, so it
+			 * is necessary to relax periodically and raise a QS to
+			 * avoid starving writers under frequent repoll, e.g.
+			 * ftrace trampoline clean up work. When not repoll,
+			 * napi_thread_wait will enter sleep and have the same
+			 * QS effect.
+			 */
+			if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
+			    time_after(jiffies, next_qs)) {
+				preempt_disable();
+				rcu_softirq_qs();
+				preempt_enable();
+				next_qs = jiffies + HZ;
+			}
+
 			cond_resched();
 		}
 	}