diff mbox series

[v5,net,1/3] rcu: add a helper to report consolidated flavor QS

Message ID 90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com (mailing list archive)
State Accepted
Commit 1a77557d48cff187a169c2aec01c0dd78a5e7e50
Delegated to: Netdev Maintainers
Headers show
Series Report RCU QS for busy network kthreads | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 15258 this patch: 15258
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 7 maintainers not CCed: josh@joshtriplett.org boqun.feng@gmail.com jiangshanlai@gmail.com mathieu.desnoyers@efficios.com qiang.zhang1211@gmail.com frederic@kernel.org quic_neeraju@quicinc.com
netdev/build_clang success Errors and warnings before: 3081 this patch: 3081
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 16175 this patch: 16175
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 37 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-20--12-00 (tests: 910)

Commit Message

Yan Zhai March 19, 2024, 8:44 p.m. UTC
When under heavy load, network processing can run CPU-bound for many
tens of seconds. Even in preemptible kernels (non-RT kernel), this can
block RCU Tasks grace periods, which can cause trace-event removal to
take more than a minute, which is unacceptably long.

This commit therefore creates a new helper function that passes through
both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
hard-coded value suffices for current workloads.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v4->v5: adjusted kernel docs and commit message
v3->v4: kernel docs error

---
 include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Paul E. McKenney March 19, 2024, 9:31 p.m. UTC | #1
On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> When under heavy load, network processing can run CPU-bound for many
> tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> block RCU Tasks grace periods, which can cause trace-event removal to
> take more than a minute, which is unacceptably long.
> 
> This commit therefore creates a new helper function that passes through
> both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> hard-coded value suffices for current workloads.
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>

If you would like me to take this one via -rcu, I would be happy to take
it.  If it would be easier for you to push these as a group though
networking:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> v4->v5: adjusted kernel docs and commit message
> v3->v4: kernel docs error
> 
> ---
>  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 16f519914415..17d7ed5f3ae6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -247,6 +247,37 @@ do { \
>  	cond_resched(); \
>  } while (0)
>  
> +/**
> + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> + * @old_ts: jiffies at start of processing.
> + *
> + * This helper is for long-running softirq handlers, such as NAPI threads in
> + * networking. The caller should initialize the variable passed in as @old_ts
> + * at the beginning of the softirq handler. When invoked frequently, this macro
> + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> + * modifies its old_ts argument.
> + *
> + * Because regions of code that have disabled softirq act as RCU read-side
> + * critical sections, this macro should be invoked with softirq (and
> + * preemption) enabled.
> + *
> + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> + * have more chance to invoke schedule() calls and provide necessary quiescent
> + * states. As a contrast, calling cond_resched() only won't achieve the same
> + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> + */
> +#define rcu_softirq_qs_periodic(old_ts) \
> +do { \
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> +		preempt_disable(); \
> +		rcu_softirq_qs(); \
> +		preempt_enable(); \
> +		(old_ts) = jiffies; \
> +	} \
> +} while (0)
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> -- 
> 2.30.2
> 
>
Yan Zhai March 19, 2024, 10 p.m. UTC | #2
Hi Paul,

On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> > When under heavy load, network processing can run CPU-bound for many
> > tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> > block RCU Tasks grace periods, which can cause trace-event removal to
> > take more than a minute, which is unacceptably long.
> >
> > This commit therefore creates a new helper function that passes through
> > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> > hard-coded value suffices for current workloads.
> >
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
>
> If you would like me to take this one via -rcu, I would be happy to take
> it.  If it would be easier for you to push these as a group though
> networking:
>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
>

Since the whole series aims at fixing net problems, going through net
is probably more consistent.
Also, thank you for your help through the series!

Yan

> > ---
> > v4->v5: adjusted kernel docs and commit message
> > v3->v4: kernel docs error
> >
> > ---
> >  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 16f519914415..17d7ed5f3ae6 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,37 @@ do { \
> >       cond_resched(); \
> >  } while (0)
> >
> > +/**
> > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > + * @old_ts: jiffies at start of processing.
> > + *
> > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > + * networking. The caller should initialize the variable passed in as @old_ts
> > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > + * modifies its old_ts argument.
> > + *
> > + * Because regions of code that have disabled softirq act as RCU read-side
> > + * critical sections, this macro should be invoked with softirq (and
> > + * preemption) enabled.
> > + *
> > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > + */
> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +         time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +             preempt_disable(); \
> > +             rcu_softirq_qs(); \
> > +             preempt_enable(); \
> > +             (old_ts) = jiffies; \
> > +     } \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > --
> > 2.30.2
> >
> >
Paul E. McKenney March 19, 2024, 10:08 p.m. UTC | #3
On Tue, Mar 19, 2024 at 05:00:24PM -0500, Yan Zhai wrote:
> Hi Paul,
> 
> On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> > > When under heavy load, network processing can run CPU-bound for many
> > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> > > block RCU Tasks grace periods, which can cause trace-event removal to
> > > take more than a minute, which is unacceptably long.
> > >
> > > This commit therefore creates a new helper function that passes through
> > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> > > hard-coded value suffices for current workloads.
> > >
> > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> >
> > If you would like me to take this one via -rcu, I would be happy to take
> > it.  If it would be easier for you to push these as a group though
> > networking:
> >
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Since the whole series aims at fixing net problems, going through net
> is probably more consistent.

Very good!  I will let you push it along.

> Also, thank you for your help through the series!

No, thank you!  I had just been asked to find this slowdown when you
posted the patch.  So it worked out extremely well for me!  ;-)

							Thanx, Paul

> Yan
> 
> > > ---
> > > v4->v5: adjusted kernel docs and commit message
> > > v3->v4: kernel docs error
> > >
> > > ---
> > >  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
> > >  1 file changed, 31 insertions(+)
> > >
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 16f519914415..17d7ed5f3ae6 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -247,6 +247,37 @@ do { \
> > >       cond_resched(); \
> > >  } while (0)
> > >
> > > +/**
> > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > > + * @old_ts: jiffies at start of processing.
> > > + *
> > > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > > + * networking. The caller should initialize the variable passed in as @old_ts
> > > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > > + * modifies its old_ts argument.
> > > + *
> > > + * Because regions of code that have disabled softirq act as RCU read-side
> > > + * critical sections, this macro should be invoked with softirq (and
> > > + * preemption) enabled.
> > > + *
> > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > + */
> > > +#define rcu_softirq_qs_periodic(old_ts) \
> > > +do { \
> > > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > > +         time_after(jiffies, (old_ts) + HZ / 10)) { \
> > > +             preempt_disable(); \
> > > +             rcu_softirq_qs(); \
> > > +             preempt_enable(); \
> > > +             (old_ts) = jiffies; \
> > > +     } \
> > > +} while (0)
> > > +
> > >  /*
> > >   * Infrastructure to implement the synchronize_() primitives in
> > >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > > --
> > > 2.30.2
> > >
> > >
Sebastian Andrzej Siewior March 22, 2024, 11:24 a.m. UTC | #4
On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> index 16f519914415..17d7ed5f3ae6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -247,6 +247,37 @@ do { \
>  	cond_resched(); \
>  } while (0)
>  
> +/**
> + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> + * @old_ts: jiffies at start of processing.
> + *
> + * This helper is for long-running softirq handlers, such as NAPI threads in
> + * networking. The caller should initialize the variable passed in as @old_ts
> + * at the beginning of the softirq handler. When invoked frequently, this macro
> + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> + * modifies its old_ts argument.
> + *
> + * Because regions of code that have disabled softirq act as RCU read-side
> + * critical sections, this macro should be invoked with softirq (and
> + * preemption) enabled.
> + *
> + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> + * have more chance to invoke schedule() calls and provide necessary quiescent
> + * states. As a contrast, calling cond_resched() only won't achieve the same
> + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> + */

Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
Why does RT have more scheduling points?
The RCU-Tasks thread is starving and yet there is no wake-up, correct?
Shouldn't cond_resched() take care of RCU-Tasks's needs, too?
This function is used by napi_threaded_poll() which is not invoked in
softirq it is a simple thread which does disable BH but this is it. Any
pending softirqs are served before the cond_resched().

This napi_threaded_poll() case _basically_ a busy thread doing a lot of
work and delaying RCU-Tasks as far as I understand. The same may happen
to other busy-worker which have cond_resched() between works, such as
the kworker. Therefore I would expect to have some kind of timeout at
which point NEED_RESCHED is set so that cond_resched() can do its work.

> +#define rcu_softirq_qs_periodic(old_ts) \
> +do { \
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> +		preempt_disable(); \
> +		rcu_softirq_qs(); \
> +		preempt_enable(); \
> +		(old_ts) = jiffies; \
> +	} \
> +} while (0)
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.

Sebastian
Paul E. McKenney March 22, 2024, 9:30 p.m. UTC | #5
On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > index 16f519914415..17d7ed5f3ae6 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,37 @@ do { \
> >  	cond_resched(); \
> >  } while (0)
> >  
> > +/**
> > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > + * @old_ts: jiffies at start of processing.
> > + *
> > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > + * networking. The caller should initialize the variable passed in as @old_ts
> > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > + * modifies its old_ts argument.
> > + *
> > + * Because regions of code that have disabled softirq act as RCU read-side
> > + * critical sections, this macro should be invoked with softirq (and
> > + * preemption) enabled.
> > + *
> > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > + */
> 
> Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> Why does RT have more scheduling points?

In RT, isn't BH-disabled code preemptible?  But yes, this would not help
RCU Tasks.

> The RCU-Tasks thread is starving and yet there is no wake-up, correct?
> Shouldn't cond_resched() take care of RCU-Tasks's needs, too?
> This function is used by napi_threaded_poll() which is not invoked in
> softirq it is a simple thread which does disable BH but this is it. Any
> pending softirqs are served before the cond_resched().
> 
> This napi_threaded_poll() case _basically_ a busy thread doing a lot of
> work and delaying RCU-Tasks as far as I understand. The same may happen
> to other busy-worker which have cond_resched() between works, such as
> the kworker. Therefore I would expect to have some kind of timeout at
> which point NEED_RESCHED is set so that cond_resched() can do its work.

A NEED_RESCHED with a cond_resched() would still be counted as a
preemption.  If we were intending to keep cond_resched(), I would
be thinking in terms of changing that, but only for Tasks RCU.

Given no cond_resched(), I would be thinking in terms of removing
the check for CONFIG_PREEMPT_RT.

Thoughts?

							Thanx, Paul

> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +		preempt_disable(); \
> > +		rcu_softirq_qs(); \
> > +		preempt_enable(); \
> > +		(old_ts) = jiffies; \
> > +	} \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> 
> Sebastian
Yan Zhai March 23, 2024, 2:02 a.m. UTC | #6
On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > + */
> >
> > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > Why does RT have more scheduling points?
>
> In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> RCU Tasks.
>
By "more chance to invoke schedule()", my thought was that
cond_resched becomes no op on RT or PREEMPT kernel. So it will not
call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
normal irq exit like timer, when NEED_RESCHED is on,
schedule()/__schedule(0) can be called time by time then.
__schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.

But I think this code comment does not take into account frequent
preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
When returning to these busy kthreads, irqentry_exit_cond_resched is
in fact called now, not schedule(). So likely __schedule(PREEMPT) is
still called frequently, or even more frequently. So the code comment
looks incorrect on the RT argument part. We probably should remove the
"IS_ENABLED" condition really. Paul and Sebastian, does this sound
reasonable to you?

Yan
Paul E. McKenney March 23, 2024, 11:53 p.m. UTC | #7
On Fri, Mar 22, 2024 at 09:02:02PM -0500, Yan Zhai wrote:
> On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > + */
> > >
> > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > Why does RT have more scheduling points?
> >
> > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > RCU Tasks.
> >
> By "more chance to invoke schedule()", my thought was that
> cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
> normal irq exit like timer, when NEED_RESCHED is on,
> schedule()/__schedule(0) can be called time by time then.
> __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
> 
> But I think this code comment does not take into account frequent
> preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> When returning to these busy kthreads, irqentry_exit_cond_resched is
> in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> still called frequently, or even more frequently. So the code comment
> looks incorrect on the RT argument part. We probably should remove the
> "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> reasonable to you?

Removing the "IS_ENABLED(CONFIG_PREEMPT_RT)" condition makes a great deal
of sense to me, but I must defer to Sebastian for any RT implications.

							Thanx, Paul
Sebastian Andrzej Siewior April 5, 2024, 1:49 p.m. UTC | #8
On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote:
> On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > + */
> > >
> > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > Why does RT have more scheduling points?
> >
> > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > RCU Tasks.
Yes, it is but why does it matter? This is used in the NAPI thread which
fully preemptible and does cond_resched(). This should be enough.

> By "more chance to invoke schedule()", my thought was that
> cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
It will nop cond_resched(), correct. However once something sends
NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT)
as soon as possible. That is either because the scheduler sends an IPI
and the CPU will do it in the irq-exit path _or_ the thread does
preempt_enable() (which includes local_bh_enable()) and the counter hits
zero at which point the same context switch happens.

Therefore I don't see a difference between CONFIG_PREEMPT and
CONFIG_PREEMPT_RT.

> normal irq exit like timer, when NEED_RESCHED is on,
> schedule()/__schedule(0) can be called time by time then.

This I can not parse. __schedule(0) means the task gives up on its own
and goes to sleep. This does not happen for the NAPI-thread loop,
kworker loop or any other loop that consumes one work item after the
other and relies on cond_resched() in between.

> __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
Okay and that is why? This means you expect that every thread gives up
on its own which may take some time depending on the workload. This
should not matter.

If I see this right, the only difference is rcu_tasks_classic_qs() and I
didn't figure out yet what it does.

> But I think this code comment does not take into account frequent
> preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> When returning to these busy kthreads, irqentry_exit_cond_resched is
> in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> still called frequently, or even more frequently. So the code comment
> looks incorrect on the RT argument part. We probably should remove the
> "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> reasonable to you?

Can you walk me through it? Why is it so important for a task to give up
voluntary? There is something wrong here with how RCU tasks works.
We want to get rid of the sprinkled cond_resched(). This looks like a
another variant of it that might be required in places with no
explanation except it takes too long. 

> Yan

Sebastian
Paul E. McKenney April 5, 2024, 6:13 p.m. UTC | #9
On Fri, Apr 05, 2024 at 03:49:46PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote:
> > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > > + */
> > > >
> > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > > Why does RT have more scheduling points?
> > >
> > > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > > RCU Tasks.
> Yes, it is but why does it matter? This is used in the NAPI thread which
> fully preemptible and does cond_resched(). This should be enough.

By the time it gets to RCU, a cond_resched()-induced context switch looks
like a preemption.  This is fine for normal RCU, but Tasks RCU needs
a voluntary context switch for a quiescent state.  Which makes sense,
given that code running in a trampoline that is invoked from preemptible
code can itself be preempted.

So that additional call to rcu_softirq_qs() is needed.  (Which invokes
rcu_tasks_qs() which in turn invokes rcu_tasks_classic_qs().

And maybe it is also needed for RT.  The argument against would be that
RT applications have significant idle time on the one hand or spend lots
of time in nohz_full userspace on the other, both of which are quiescent
states for RCU Tasks.  But you tell me!

> > By "more chance to invoke schedule()", my thought was that
> > cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
> It will nop cond_resched(), correct. However once something sends
> NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT)
> as soon as possible. That is either because the scheduler sends an IPI
> and the CPU will do it in the irq-exit path _or_ the thread does
> preempt_enable() (which includes local_bh_enable()) and the counter hits
> zero at which point the same context switch happens.
> 
> Therefore I don't see a difference between CONFIG_PREEMPT and
> CONFIG_PREEMPT_RT.

True, but again RCU Tasks needs a voluntary context switch and the
resulting preemption therefore does not qualify.

> > normal irq exit like timer, when NEED_RESCHED is on,
> > schedule()/__schedule(0) can be called time by time then.
> 
> This I can not parse. __schedule(0) means the task gives up on its own
> and goes to sleep. This does not happen for the NAPI-thread loop,
> kworker loop or any other loop that consumes one work item after the
> other and relies on cond_resched() in between.
> 
> > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
> Okay and that is why? This means you expect that every thread gives up
> on its own which may take some time depending on the workload. This
> should not matter.
> 
> If I see this right, the only difference is rcu_tasks_classic_qs() and I
> didn't figure out yet what it does.

It marks the task as having passed through a Tasks RCU quiescent state.
Which works because this is called from task level (as opposed to from
irq, softirq, or NMI), so it cannot be returning to a trampoline that
is protected by Tasks RCU.

Later on, the Tasks RCU grace-period kthread will see the marking, and
remove this task from the list that is blocking the current Tasks-RCU
grace period.

> > But I think this code comment does not take into account frequent
> > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> > When returning to these busy kthreads, irqentry_exit_cond_resched is
> > in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> > still called frequently, or even more frequently. So the code comment
> > looks incorrect on the RT argument part. We probably should remove the
> > "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> > reasonable to you?
> 
> Can you walk me through it? Why is it so important for a task to give up
> voluntary? There is something wrong here with how RCU tasks works.
> We want to get rid of the sprinkled cond_resched(). This looks like a
> another variant of it that might be required in places with no
> explanation except it takes too long. 

Hmmm...  I would normally point you at the Requirements.rst [1] document
for a walkthrough, but it does not cover all of this.

How about the upgrade shown below?

I agree that it would be nice if Tasks RCU did not have to depend on
voluntary context switches.  In theory, one alternative would be to
examine each task's stack, looking for return addresses in trampolines.
In practice, we recently took a look at this and other alternatives,
but none were feasible [2].  If you know of a better way, please do not
keep it a secret!

Note that kernel live patching has similar needs, and may need
similar annotations/innovations.

							Thanx, Paul

[1] Documentation/RCU/Design/Requirements/Requirements.rst
[2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

------------------------------------------------------------------------
As written:
------------------------------------------------------------------------

Tasks RCU
~~~~~~~~~

Some forms of tracing use “trampolines” to handle the binary rewriting
required to install different types of probes. It would be good to be
able to free old trampolines, which sounds like a job for some form of
RCU. However, because it is necessary to be able to install a trace
anywhere in the code, it is not possible to use read-side markers such
as rcu_read_lock() and rcu_read_unlock(). In addition, it does
not work to have these markers in the trampoline itself, because there
would need to be instructions following rcu_read_unlock(). Although
synchronize_rcu() would guarantee that execution reached the
rcu_read_unlock(), it would not be able to guarantee that execution
had completely left the trampoline. Worse yet, in some situations
the trampoline's protection must extend a few instructions *prior* to
execution reaching the trampoline.  For example, these few instructions
might calculate the address of the trampoline, so that entering the
trampoline would be pre-ordained a surprisingly long time before execution
actually reached the trampoline itself.

The solution, in the form of `Tasks
RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
critical sections that are delimited by voluntary context switches, that
is, calls to schedule(), cond_resched(), and
synchronize_rcu_tasks(). In addition, transitions to and from
userspace execution also delimit tasks-RCU read-side critical sections.
Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
interact with them.

Note well that involuntary context switches are *not* Tasks-RCU quiescent
states.  After all, in preemptible kernels, a task executing code in a
trampoline might be preempted.  In this case, the Tasks-RCU grace period
clearly cannot end until that task resumes and its execution leaves that
trampoline.  This means, among other things, that cond_resched() does
not provide a Tasks RCU quiescent state.  (Instead, use rcu_softirq_qs()
from softirq or rcu_tasks_classic_qs() otherwise.)

The tasks-RCU API is quite compact, consisting only of
call_rcu_tasks(), synchronize_rcu_tasks(), and
rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines
cannot be preempted, so these APIs map to call_rcu(),
synchronize_rcu(), and rcu_barrier(), respectively. In
``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these
three APIs are therefore implemented by separate functions that check
for voluntary context switches.

------------------------------------------------------------------------
As patch:
------------------------------------------------------------------------

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index cccafdaa1f849..f511476b45506 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2357,6 +2357,7 @@ section.
 #. `Sched Flavor (Historical)`_
 #. `Sleepable RCU`_
 #. `Tasks RCU`_
+#. `Tasks Trace RCU`_
 
 Bottom-Half Flavor (Historical)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2610,6 +2611,16 @@ critical sections that are delimited by voluntary context switches, that
 is, calls to schedule(), cond_resched(), and
 synchronize_rcu_tasks(). In addition, transitions to and from
 userspace execution also delimit tasks-RCU read-side critical sections.
+Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
+interact with them.
+
+Note well that involuntary context switches are *not* Tasks-RCU quiescent
+states.  After all, in preemptible kernels, a task executing code in a
+trampoline might be preempted.  In this case, the Tasks-RCU grace period
+clearly cannot end until that task resumes and its execution leaves that
+trampoline.  This means, among other things, that cond_resched() does
+not provide a Tasks RCU quiescent state.  (Instead, use rcu_softirq_qs()
+from softirq or rcu_tasks_classic_qs() otherwise.)
 
 The tasks-RCU API is quite compact, consisting only of
 call_rcu_tasks(), synchronize_rcu_tasks(), and
@@ -2632,6 +2643,11 @@ moniker.  And this operation is considered to be quite rude by real-time
 workloads that don't want their ``nohz_full`` CPUs receiving IPIs and
 by battery-powered systems that don't want their idle CPUs to be awakened.
 
+Once kernel entry/exit and deep-idle functions have been properly tagged
+``noinstr``, Tasks RCU can start paying attention to idle tasks (except
+those that are idle from RCU's perspective) and then Tasks Rude RCU can
+be removed from the kernel.
+
 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
 consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(),
 and rcu_barrier_tasks_rude().
diff mbox series

Patch

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 16f519914415..17d7ed5f3ae6 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -247,6 +247,37 @@  do { \
 	cond_resched(); \
 } while (0)
 
+/**
+ * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
+ * @old_ts: jiffies at start of processing.
+ *
+ * This helper is for long-running softirq handlers, such as NAPI threads in
+ * networking. The caller should initialize the variable passed in as @old_ts
+ * at the beginning of the softirq handler. When invoked frequently, this macro
+ * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
+ * provide both RCU and RCU-Tasks quiescent states. Note that this macro
+ * modifies its old_ts argument.
+ *
+ * Because regions of code that have disabled softirq act as RCU read-side
+ * critical sections, this macro should be invoked with softirq (and
+ * preemption) enabled.
+ *
+ * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
+ * have more chance to invoke schedule() calls and provide necessary quiescent
+ * states. As a contrast, calling cond_resched() only won't achieve the same
+ * effect because cond_resched() does not provide RCU-Tasks quiescent states.
+ */
+#define rcu_softirq_qs_periodic(old_ts) \
+do { \
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
+	    time_after(jiffies, (old_ts) + HZ / 10)) { \
+		preempt_disable(); \
+		rcu_softirq_qs(); \
+		preempt_enable(); \
+		(old_ts) = jiffies; \
+	} \
+} while (0)
+
 /*
  * Infrastructure to implement the synchronize_() primitives in
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.