Message ID | ZeFPz4D121TgvCje@debian.debian (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] net: raise RCU qs after each threaded NAPI poll | expand |
On Thu, Feb 29, 2024 at 9:47 PM Yan Zhai <yan@cloudflare.com> wrote: > > We noticed task RCUs being blocked when threaded NAPIs are very busy at > workloads: detaching any BPF tracing programs, i.e. removing a ftrace > trampoline, will simply block for very long in rcu_tasks_wait_gp. This > ranges from hundreds of seconds to even an hour, severely harming any > observability tools that rely on BPF tracing programs. It can be > easily reproduced locally with following setup: > > ip netns add test1 > ip netns add test2 > > ip -n test1 link add veth1 type veth peer name veth2 netns test2 > > ip -n test1 link set veth1 up > ip -n test1 link set lo up > ip -n test2 link set veth2 up > ip -n test2 link set lo up > > ip -n test1 addr add 192.168.1.2/31 dev veth1 > ip -n test1 addr add 1.1.1.1/32 dev lo > ip -n test2 addr add 192.168.1.3/31 dev veth2 > ip -n test2 addr add 2.2.2.2/31 dev lo > > ip -n test1 route add default via 192.168.1.3 > ip -n test2 route add default via 192.168.1.2 > > for i in `seq 10 210`; do > for j in `seq 10 210`; do > ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 > done > done > > ip netns exec test2 ethtool -K veth2 gro on > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' > ip netns exec test1 ethtool -K veth1 tso off > > Then run an iperf3 client/server and a bpftrace script can trigger it: > > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' > > Above reproduce for net-next kernel with following RCU and preempt > configuraitons: > > # RCU Subsystem > CONFIG_TREE_RCU=y > CONFIG_PREEMPT_RCU=y > # CONFIG_RCU_EXPERT is not set > CONFIG_SRCU=y > CONFIG_TREE_SRCU=y > CONFIG_TASKS_RCU_GENERIC=y > CONFIG_TASKS_RCU=y > CONFIG_TASKS_RUDE_RCU=y > CONFIG_TASKS_TRACE_RCU=y > CONFIG_RCU_STALL_COMMON=y > CONFIG_RCU_NEED_SEGCBLIST=y > # end of RCU Subsystem > # RCU Debugging > # CONFIG_RCU_SCALE_TEST is not set > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_RCU_REF_SCALE_TEST is not set > CONFIG_RCU_CPU_STALL_TIMEOUT=21 > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0 > # CONFIG_RCU_TRACE is not set > # CONFIG_RCU_EQS_DEBUG is not set > # end of RCU Debugging > > CONFIG_PREEMPT_BUILD=y > # CONFIG_PREEMPT_NONE is not set > CONFIG_PREEMPT_VOLUNTARY=y > # CONFIG_PREEMPT is not set > CONFIG_PREEMPT_COUNT=y > CONFIG_PREEMPTION=y > CONFIG_PREEMPT_DYNAMIC=y > CONFIG_PREEMPT_RCU=y > CONFIG_HAVE_PREEMPT_DYNAMIC=y > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y > CONFIG_PREEMPT_NOTIFIERS=y > # CONFIG_DEBUG_PREEMPT is not set > # CONFIG_PREEMPT_TRACER is not set > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set > > An interesting observation is that, while tasks RCUs are blocked, > related NAPI thread is still being scheduled (even across cores) > regularly. Looking at the gp conditions, I am inclining to cond_resched > after each __napi_poll being the problem: cond_resched enters the > scheduler with PREEMPT bit, which does not account as a gp for tasks > RCUs. Meanwhile, since the thread has been frequently resched, the > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp) > seems to have very little chance to kick in. Given the nature of "busy > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq > updated (other gp conditions), the result is that such NAPI thread is > put on RCU holdouts list for indefinitely long time. > > This is simply fixed by adapting similar behavior of ksoftirqd: after > the thread repolls for a while, raise a RCU QS to help expedite the > tasks RCU grace period. No more blocking afterwards. > > Some brief iperf3 throughput testing in my VM with net-next kernel shows no > noteable perf difference with 1500 byte MTU for 10 repeat runs each: > > Before: > UDP: 3.073Gbps (+-0.070Gbps) > TCP: 37.850Gbps (+-1.947Gbps) > > After: > UDP: 3.077Gbps (+-0.121 Gbps) > TCP: 38.120Gbps (+-2.272 Gbps) > > Note I didn't enable GRO for UDP so its throughput is lower than TCP. > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> > --- > v1->v2: moved rcu_softirq_qs out from bh critical section, and only > raise it after a second of repolling. Added some brief perf test result. > Link to v1: https://lore.kernel.org/netdev/Zd4DXTyCf17lcTfq@debian.debian/T/#u And I apparently forgot to rename the subject since it's not raising after every poll (let me know if it is prefered to send a V3 to fix it) thanks Yan > --- > net/core/dev.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 275fd5259a4a..76cff3849e1f 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6751,9 +6751,12 @@ static int napi_threaded_poll(void *data) > { > struct napi_struct *napi = data; > struct softnet_data *sd; > + unsigned long next_qs; > void *have; > > while (!napi_thread_wait(napi)) { > + next_qs = jiffies + HZ; > + > for (;;) { > bool repoll = false; > > @@ -6778,6 +6781,21 @@ static int napi_threaded_poll(void *data) > if (!repoll) > break; > > + /* cond_resched cannot unblock tasks RCU writers, so it > + * is necessary to relax periodically and raise a QS to > + * avoid starving writers under frequent repoll, e.g. > + * ftrace trampoline clean up work. When not repoll, > + * napi_thread_wait will enter sleep and have the same > + * QS effect. > + */ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && > + time_after(jiffies, next_qs)) { > + preempt_disable(); > + rcu_softirq_qs(); > + preempt_enable(); > + next_qs = jiffies + HZ; > + } > + > cond_resched(); > } > } > -- > 2.30.2 >
On Fri, Mar 1, 2024 at 4:50 AM Yan Zhai <yan@cloudflare.com> wrote: > > On Thu, Feb 29, 2024 at 9:47 PM Yan Zhai <yan@cloudflare.com> wrote: > > > > We noticed task RCUs being blocked when threaded NAPIs are very busy at > > workloads: detaching any BPF tracing programs, i.e. removing a ftrace > > trampoline, will simply block for very long in rcu_tasks_wait_gp. This > > ranges from hundreds of seconds to even an hour, severely harming any ... > > > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > --- > > v1->v2: moved rcu_softirq_qs out from bh critical section, and only > > raise it after a second of repolling. Added some brief perf test result. > > > Link to v1: https://lore.kernel.org/netdev/Zd4DXTyCf17lcTfq@debian.debian/T/#u > And I apparently forgot to rename the subject since it's not raising > after every poll (let me know if it is prefered to send a V3 to fix > it) > I could not see the reason for 1sec (HZ) delays. Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ? In anycase, if this all about rcu_tasks, I would prefer using a macro defined in kernel/rcu/tasks.h instead of having a hidden constant in a networking core function. Thanks.
Hi Eric, On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote: > > I could not see the reason for 1sec (HZ) delays. > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ? > The trouble scenarios are often when we need to detach an ad-hoc BPF tracing program, or restart a monitoring service. It is fine as long as they do not block for 10+ seconds or even completely stall under heavy traffic. Raising a QS every few ms or HZ both work in such cases. > In anycase, if this all about rcu_tasks, I would prefer using a macro > defined in kernel/rcu/tasks.h > instead of having a hidden constant in a networking core function. Paul E. McKenney was suggesting either current form or local_bh_enable(); if (!IS_ENABLED(CONFIG_PREEMPT_RT)) rcu_softirq_qs_enable(local_bh_enable()); else local_bh_enable(); With an interval it might have to be "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an unnecessary extern/static var. Will it make more sense to you? thanks > > Thanks.
On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote: > Hi Eric, > > On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote: > > > > I could not see the reason for 1sec (HZ) delays. > > > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ? > > > The trouble scenarios are often when we need to detach an ad-hoc BPF > tracing program, or restart a monitoring service. It is fine as long > as they do not block for 10+ seconds or even completely stall under > heavy traffic. Raising a QS every few ms or HZ both work in such > cases. > > > In anycase, if this all about rcu_tasks, I would prefer using a macro > > defined in kernel/rcu/tasks.h > > instead of having a hidden constant in a networking core function. > > Paul E. McKenney was suggesting either current form or > > local_bh_enable(); > if (!IS_ENABLED(CONFIG_PREEMPT_RT)) > rcu_softirq_qs_enable(local_bh_enable()); > else > local_bh_enable(); > > With an interval it might have to be > "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an > unnecessary extern/static var. Will it make more sense to you? I was thinking in terms of something like this (untested): #define rcu_softirq_qs_enable(enable_stmt, oldj) \ do { \ if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ time_after(oldj + HZ / 10, jiffies) { \ rcu_softirq_qs(); \ (oldj) = jiffies; \ } \ do { enable_stmt; } while (0) \ } while (0) Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)", where last_qs is initialized by the caller to jiffies. The reason for putting "enable_stmt;" into anothor do-while loop is in case someone typos an "else" as the first part of the "enable_stmt" argument. Would that work? Thanx, Paul
On Fri, Mar 1, 2024 at 4:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote: > > Hi Eric, > > > > On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > I could not see the reason for 1sec (HZ) delays. > > > > > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ? > > > > > The trouble scenarios are often when we need to detach an ad-hoc BPF > > tracing program, or restart a monitoring service. It is fine as long > > as they do not block for 10+ seconds or even completely stall under > > heavy traffic. Raising a QS every few ms or HZ both work in such > > cases. > > > > > In anycase, if this all about rcu_tasks, I would prefer using a macro > > > defined in kernel/rcu/tasks.h > > > instead of having a hidden constant in a networking core function. > > > > Paul E. McKenney was suggesting either current form or > > > > local_bh_enable(); > > if (!IS_ENABLED(CONFIG_PREEMPT_RT)) > > rcu_softirq_qs_enable(local_bh_enable()); > > else > > local_bh_enable(); > > > > With an interval it might have to be > > "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an > > unnecessary extern/static var. Will it make more sense to you? > > I was thinking in terms of something like this (untested): > > #define rcu_softirq_qs_enable(enable_stmt, oldj) \ > do { \ > if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > time_after(oldj + HZ / 10, jiffies) { \ > rcu_softirq_qs(); \ > (oldj) = jiffies; \ > } \ > do { enable_stmt; } while (0) \ > } while (0) > > Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)", > where last_qs is initialized by the caller to jiffies. > > The reason for putting "enable_stmt;" into anothor do-while loop is > in case someone typos an "else" as the first part of the "enable_stmt" > argument. > > Would that work? > Thanks Paul, just got time to continue this thread as I was travelling. I think it is probably better to move preempt_disable/enable into the macro to avoid the friction. And also since this can affect NAPI thread, NAPI busy loop and XDP cpu map thread (+Jesper who reminded me about this), let me send a v3 later to cover all of those places. Yan > Thanx, Paul
On Mon, Mar 11, 2024 at 05:58:16PM -0500, Yan Zhai wrote: > On Fri, Mar 1, 2024 at 4:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 01, 2024 at 11:30:29AM -0600, Yan Zhai wrote: > > > Hi Eric, > > > > > > On Fri, Mar 1, 2024 at 2:30 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > I could not see the reason for 1sec (HZ) delays. > > > > > > > > Would calling rcu_softirq_qs() every ~10ms instead be a serious issue ? > > > > > > > The trouble scenarios are often when we need to detach an ad-hoc BPF > > > tracing program, or restart a monitoring service. It is fine as long > > > as they do not block for 10+ seconds or even completely stall under > > > heavy traffic. Raising a QS every few ms or HZ both work in such > > > cases. > > > > > > > In anycase, if this all about rcu_tasks, I would prefer using a macro > > > > defined in kernel/rcu/tasks.h > > > > instead of having a hidden constant in a networking core function. > > > > > > Paul E. McKenney was suggesting either current form or > > > > > > local_bh_enable(); > > > if (!IS_ENABLED(CONFIG_PREEMPT_RT)) > > > rcu_softirq_qs_enable(local_bh_enable()); > > > else > > > local_bh_enable(); > > > > > > With an interval it might have to be > > > "rcu_softirq_qs_enable(local_bh_enable(), &next_qs);" to avoid an > > > unnecessary extern/static var. Will it make more sense to you? > > > > I was thinking in terms of something like this (untested): > > > > #define rcu_softirq_qs_enable(enable_stmt, oldj) \ > > do { \ > > if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > time_after(oldj + HZ / 10, jiffies) { \ > > rcu_softirq_qs(); \ > > (oldj) = jiffies; \ > > } \ > > do { enable_stmt; } while (0) \ > > } while (0) > > > > Then the call could be "rcu_softirq_qs_enable(local_bh_enable(), last_qs)", > > where last_qs is initialized by the caller to jiffies. > > > > The reason for putting "enable_stmt;" into anothor do-while loop is > > in case someone typos an "else" as the first part of the "enable_stmt" > > argument. > > > > Would that work? > > > Thanks Paul, just got time to continue this thread as I was > travelling. I think it is probably better to move > preempt_disable/enable into the macro to avoid the friction. And also > since this can affect NAPI thread, NAPI busy loop and XDP cpu map > thread (+Jesper who reminded me about this), let me send a v3 later to > cover all of those places. OK, looking forward to seeing what you come up with. Thanx, Paul
diff --git a/net/core/dev.c b/net/core/dev.c index 275fd5259a4a..76cff3849e1f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6751,9 +6751,12 @@ static int napi_threaded_poll(void *data) { struct napi_struct *napi = data; struct softnet_data *sd; + unsigned long next_qs; void *have; while (!napi_thread_wait(napi)) { + next_qs = jiffies + HZ; + for (;;) { bool repoll = false; @@ -6778,6 +6781,21 @@ static int napi_threaded_poll(void *data) if (!repoll) break; + /* cond_resched cannot unblock tasks RCU writers, so it + * is necessary to relax periodically and raise a QS to + * avoid starving writers under frequent repoll, e.g. + * ftrace trampoline clean up work. When not repoll, + * napi_thread_wait will enter sleep and have the same + * QS effect. + */ + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && + time_after(jiffies, next_qs)) { + preempt_disable(); + rcu_softirq_qs(); + preempt_enable(); + next_qs = jiffies + HZ; + } + cond_resched(); } }