Message ID | 90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 1a77557d48cff187a169c2aec01c0dd78a5e7e50 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Report RCU QS for busy network kthreads | expand |
On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > When under heavy load, network processing can run CPU-bound for many > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > block RCU Tasks grace periods, which can cause trace-event removal to > take more than a minute, which is unacceptably long. > > This commit therefore creates a new helper function that passes through > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > hard-coded value suffices for current workloads. > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> If you would like me to take this one via -rcu, I would be happy to take it. If it would be easier for you to push these as a group though networking: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > --- > v4->v5: adjusted kernel docs and commit message > v3->v4: kernel docs error > > --- > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 16f519914415..17d7ed5f3ae6 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -247,6 +247,37 @@ do { \ > cond_resched(); \ > } while (0) > > +/** > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > + * @old_ts: jiffies at start of processing. > + * > + * This helper is for long-running softirq handlers, such as NAPI threads in > + * networking. The caller should initialize the variable passed in as @old_ts > + * at the beginning of the softirq handler. When invoked frequently, this macro > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > + * modifies its old_ts argument. > + * > + * Because regions of code that have disabled softirq act as RCU read-side > + * critical sections, this macro should be invoked with softirq (and > + * preemption) enabled. > + * > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > + * have more chance to invoke schedule() calls and provide necessary quiescent > + * states. As a contrast, calling cond_resched() only won't achieve the same > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > + */ > +#define rcu_softirq_qs_periodic(old_ts) \ > +do { \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > + preempt_disable(); \ > + rcu_softirq_qs(); \ > + preempt_enable(); \ > + (old_ts) = jiffies; \ > + } \ > +} while (0) > + > /* > * Infrastructure to implement the synchronize_() primitives in > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > -- > 2.30.2 > >
Hi Paul, On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > > When under heavy load, network processing can run CPU-bound for many > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > > block RCU Tasks grace periods, which can cause trace-event removal to > > take more than a minute, which is unacceptably long. > > > > This commit therefore creates a new helper function that passes through > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > > hard-coded value suffices for current workloads. > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > If you would like me to take this one via -rcu, I would be happy to take > it. If it would be easier for you to push these as a group though > networking: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > Since the whole series aims at fixing net problems, going through net is probably more consistent. Also, thank you for your help through the series! Yan > > --- > > v4->v5: adjusted kernel docs and commit message > > v3->v4: kernel docs error > > > > --- > > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > > 1 file changed, 31 insertions(+) > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index 16f519914415..17d7ed5f3ae6 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,37 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > + * @old_ts: jiffies at start of processing. > > + * > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > + * networking. The caller should initialize the variable passed in as @old_ts > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > + * modifies its old_ts argument. > > + * > > + * Because regions of code that have disabled softirq act as RCU read-side > > + * critical sections, this macro should be invoked with softirq (and > > + * preemption) enabled. > > + * > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > + */ > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > -- > > 2.30.2 > > > >
On Tue, Mar 19, 2024 at 05:00:24PM -0500, Yan Zhai wrote: > Hi Paul, > > On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > > > When under heavy load, network processing can run CPU-bound for many > > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > > > block RCU Tasks grace periods, which can cause trace-event removal to > > > take more than a minute, which is unacceptably long. > > > > > > This commit therefore creates a new helper function that passes through > > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > > > hard-coded value suffices for current workloads. > > > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > > > If you would like me to take this one via -rcu, I would be happy to take > > it. If it would be easier for you to push these as a group though > > networking: > > > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > > Since the whole series aims at fixing net problems, going through net > is probably more consistent. Very good! I will let you push it along. > Also, thank you for your help through the series! No, thank you! I had just been asked to find this slowdown when you posted the patch. So it worked out extremely well for me! ;-) Thanx, Paul > Yan > > > > --- > > > v4->v5: adjusted kernel docs and commit message > > > v3->v4: kernel docs error > > > > > > --- > > > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > > > 1 file changed, 31 insertions(+) > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > index 16f519914415..17d7ed5f3ae6 100644 > > > --- a/include/linux/rcupdate.h > > > +++ b/include/linux/rcupdate.h > > > @@ -247,6 +247,37 @@ do { \ > > > cond_resched(); \ > > > } while (0) > > > > > > +/** > > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > > + * @old_ts: jiffies at start of processing. > > > + * > > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > > + * networking. The caller should initialize the variable passed in as @old_ts > > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > > + * modifies its old_ts argument. > > > + * > > > + * Because regions of code that have disabled softirq act as RCU read-side > > > + * critical sections, this macro should be invoked with softirq (and > > > + * preemption) enabled. > > > + * > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > + */ > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > > +do { \ > > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > > + preempt_disable(); \ > > > + rcu_softirq_qs(); \ > > > + preempt_enable(); \ > > > + (old_ts) = jiffies; \ > > > + } \ > > > +} while (0) > > > + > > > /* > > > * Infrastructure to implement the synchronize_() primitives in > > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > > -- > > > 2.30.2 > > > > > >
On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > index 16f519914415..17d7ed5f3ae6 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -247,6 +247,37 @@ do { \ > cond_resched(); \ > } while (0) > > +/** > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > + * @old_ts: jiffies at start of processing. > + * > + * This helper is for long-running softirq handlers, such as NAPI threads in > + * networking. The caller should initialize the variable passed in as @old_ts > + * at the beginning of the softirq handler. When invoked frequently, this macro > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > + * modifies its old_ts argument. > + * > + * Because regions of code that have disabled softirq act as RCU read-side > + * critical sections, this macro should be invoked with softirq (and > + * preemption) enabled. > + * > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > + * have more chance to invoke schedule() calls and provide necessary quiescent > + * states. As a contrast, calling cond_resched() only won't achieve the same > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > + */ Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. Why does RT have more scheduling points? The RCU-Tasks thread is starving and yet there is no wake-up, correct? Shouldn't cond_resched() take care of RCU-Tasks's needs, too? This function is used by napi_threaded_poll() which is not invoked in softirq it is a simple thread which does disable BH but this is it. Any pending softirqs are served before the cond_resched(). This napi_threaded_poll() case _basically_ a busy thread doing a lot of work and delaying RCU-Tasks as far as I understand. The same may happen to other busy-worker which have cond_resched() between works, such as the kworker. Therefore I would expect to have some kind of timeout at which point NEED_RESCHED is set so that cond_resched() can do its work. > +#define rcu_softirq_qs_periodic(old_ts) \ > +do { \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > + preempt_disable(); \ > + rcu_softirq_qs(); \ > + preempt_enable(); \ > + (old_ts) = jiffies; \ > + } \ > +} while (0) > + > /* > * Infrastructure to implement the synchronize_() primitives in > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. Sebastian
On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > index 16f519914415..17d7ed5f3ae6 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,37 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > + * @old_ts: jiffies at start of processing. > > + * > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > + * networking. The caller should initialize the variable passed in as @old_ts > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > + * modifies its old_ts argument. > > + * > > + * Because regions of code that have disabled softirq act as RCU read-side > > + * critical sections, this macro should be invoked with softirq (and > > + * preemption) enabled. > > + * > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > + */ > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > Why does RT have more scheduling points? In RT, isn't BH-disabled code preemptible? But yes, this would not help RCU Tasks. > The RCU-Tasks thread is starving and yet there is no wake-up, correct? > Shouldn't cond_resched() take care of RCU-Tasks's needs, too? > This function is used by napi_threaded_poll() which is not invoked in > softirq it is a simple thread which does disable BH but this is it. Any > pending softirqs are served before the cond_resched(). > > This napi_threaded_poll() case _basically_ a busy thread doing a lot of > work and delaying RCU-Tasks as far as I understand. The same may happen > to other busy-worker which have cond_resched() between works, such as > the kworker. Therefore I would expect to have some kind of timeout at > which point NEED_RESCHED is set so that cond_resched() can do its work. A NEED_RESCHED with a cond_resched() would still be counted as a preemption. If we were intending to keep cond_resched(), I would be thinking in terms of changing that, but only for Tasks RCU. Given no cond_resched(), I would be thinking in terms of removing the check for CONFIG_PREEMPT_RT. Thoughts? Thanx, Paul > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > Sebastian
On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > + */ > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > Why does RT have more scheduling points? > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > RCU Tasks. > By "more chance to invoke schedule()", my thought was that cond_resched becomes no op on RT or PREEMPT kernel. So it will not call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a normal irq exit like timer, when NEED_RESCHED is on, schedule()/__schedule(0) can be called time by time then. __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. But I think this code comment does not take into account frequent preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. When returning to these busy kthreads, irqentry_exit_cond_resched is in fact called now, not schedule(). So likely __schedule(PREEMPT) is still called frequently, or even more frequently. So the code comment looks incorrect on the RT argument part. We probably should remove the "IS_ENABLED" condition really. Paul and Sebastian, does this sound reasonable to you? Yan
On Fri, Mar 22, 2024 at 09:02:02PM -0500, Yan Zhai wrote: > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > + */ > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > Why does RT have more scheduling points? > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > RCU Tasks. > > > By "more chance to invoke schedule()", my thought was that > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a > normal irq exit like timer, when NEED_RESCHED is on, > schedule()/__schedule(0) can be called time by time then. > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. > > But I think this code comment does not take into account frequent > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > When returning to these busy kthreads, irqentry_exit_cond_resched is > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > still called frequently, or even more frequently. So the code comment > looks incorrect on the RT argument part. We probably should remove the > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > reasonable to you? Removing the "IS_ENABLED(CONFIG_PREEMPT_RT)" condition makes a great deal of sense to me, but I must defer to Sebastian for any RT implications. Thanx, Paul
On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote: > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > + */ > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > Why does RT have more scheduling points? > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > RCU Tasks. Yes, it is but why does it matter? This is used in the NAPI thread which fully preemptible and does cond_resched(). This should be enough. > By "more chance to invoke schedule()", my thought was that > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a It will nop cond_resched(), correct. However once something sends NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT) as soon as possible. That is either because the scheduler sends an IPI and the CPU will do it in the irq-exit path _or_ the thread does preempt_enable() (which includes local_bh_enable()) and the counter hits zero at which point the same context switch happens. Therefore I don't see a difference between CONFIG_PREEMPT and CONFIG_PREEMPT_RT. > normal irq exit like timer, when NEED_RESCHED is on, > schedule()/__schedule(0) can be called time by time then. This I can not parse. __schedule(0) means the task gives up on its own and goes to sleep. This does not happen for the NAPI-thread loop, kworker loop or any other loop that consumes one work item after the other and relies on cond_resched() in between. > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. Okay and that is why? This means you expect that every thread gives up on its own which may take some time depending on the workload. This should not matter. If I see this right, the only difference is rcu_tasks_classic_qs() and I didn't figure out yet what it does. > But I think this code comment does not take into account frequent > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > When returning to these busy kthreads, irqentry_exit_cond_resched is > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > still called frequently, or even more frequently. So the code comment > looks incorrect on the RT argument part. We probably should remove the > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > reasonable to you? Can you walk me through it? Why is it so important for a task to give up voluntary? There is something wrong here with how RCU tasks works. We want to get rid of the sprinkled cond_resched(). This looks like a another variant of it that might be required in places with no explanation except it takes too long. > Yan Sebastian
On Fri, Apr 05, 2024 at 03:49:46PM +0200, Sebastian Andrzej Siewior wrote: > On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote: > > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > > + */ > > > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > > Why does RT have more scheduling points? > > > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > > RCU Tasks. > Yes, it is but why does it matter? This is used in the NAPI thread which > fully preemptible and does cond_resched(). This should be enough. By the time it gets to RCU, a cond_resched()-induced context switch looks like a preemption. This is fine for normal RCU, but Tasks RCU needs a voluntary context switch for a quiescent state. Which makes sense, given that code running in a trampoline that is invoked from preemptible code can itself be preempted. So that additional call to rcu_softirq_qs() is needed. (Which invokes rcu_tasks_qs() which in turn invokes rcu_tasks_classic_qs(). And maybe it is also needed for RT. The argument against would be that RT applications have significant idle time on the one hand or spend lots of time in nohz_full userspace on the other, both of which are quiescent states for RCU Tasks. But you tell me! > > By "more chance to invoke schedule()", my thought was that > > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a > It will nop cond_resched(), correct. However once something sends > NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT) > as soon as possible. That is either because the scheduler sends an IPI > and the CPU will do it in the irq-exit path _or_ the thread does > preempt_enable() (which includes local_bh_enable()) and the counter hits > zero at which point the same context switch happens. > > Therefore I don't see a difference between CONFIG_PREEMPT and > CONFIG_PREEMPT_RT. True, but again RCU Tasks needs a voluntary context switch and the resulting preemption therefore does not qualify. > > normal irq exit like timer, when NEED_RESCHED is on, > > schedule()/__schedule(0) can be called time by time then. > > This I can not parse. __schedule(0) means the task gives up on its own > and goes to sleep. This does not happen for the NAPI-thread loop, > kworker loop or any other loop that consumes one work item after the > other and relies on cond_resched() in between. > > > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. > Okay and that is why? This means you expect that every thread gives up > on its own which may take some time depending on the workload. This > should not matter. > > If I see this right, the only difference is rcu_tasks_classic_qs() and I > didn't figure out yet what it does. It marks the task as having passed through a Tasks RCU quiescent state. Which works because this is called from task level (as opposed to from irq, softirq, or NMI), so it cannot be returning to a trampoline that is protected by Tasks RCU. Later on, the Tasks RCU grace-period kthread will see the marking, and remove this task from the list that is blocking the current Tasks-RCU grace period. > > But I think this code comment does not take into account frequent > > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > > When returning to these busy kthreads, irqentry_exit_cond_resched is > > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > > still called frequently, or even more frequently. So the code comment > > looks incorrect on the RT argument part. We probably should remove the > > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > > reasonable to you? > > Can you walk me through it? Why is it so important for a task to give up > voluntary? There is something wrong here with how RCU tasks works. > We want to get rid of the sprinkled cond_resched(). This looks like a > another variant of it that might be required in places with no > explanation except it takes too long. Hmmm... I would normally point you at the Requirements.rst [1] document for a walkthrough, but it does not cover all of this. How about the upgrade shown below? I agree that it would be nice if Tasks RCU did not have to depend on voluntary context switches. In theory, one alternative would be to examine each task's stack, looking for return addresses in trampolines. In practice, we recently took a look at this and other alternatives, but none were feasible [2]. If you know of a better way, please do not keep it a secret! Note that kernel live patching has similar needs, and may need similar annotations/innovations. Thanx, Paul [1] Documentation/RCU/Design/Requirements/Requirements.rst [2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing ------------------------------------------------------------------------ As written: ------------------------------------------------------------------------ Tasks RCU ~~~~~~~~~ Some forms of tracing use “trampolines” to handle the binary rewriting required to install different types of probes. It would be good to be able to free old trampolines, which sounds like a job for some form of RCU. However, because it is necessary to be able to install a trace anywhere in the code, it is not possible to use read-side markers such as rcu_read_lock() and rcu_read_unlock(). In addition, it does not work to have these markers in the trampoline itself, because there would need to be instructions following rcu_read_unlock(). Although synchronize_rcu() would guarantee that execution reached the rcu_read_unlock(), it would not be able to guarantee that execution had completely left the trampoline. Worse yet, in some situations the trampoline's protection must extend a few instructions *prior* to execution reaching the trampoline. For example, these few instructions might calculate the address of the trampoline, so that entering the trampoline would be pre-ordained a surprisingly long time before execution actually reached the trampoline itself. The solution, in the form of `Tasks RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections. Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to interact with them. Note well that involuntary context switches are *not* Tasks-RCU quiescent states. After all, in preemptible kernels, a task executing code in a trampoline might be preempted. In this case, the Tasks-RCU grace period clearly cannot end until that task resumes and its execution leaves that trampoline. This means, among other things, that cond_resched() does not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs() from softirq or rcu_tasks_classic_qs() otherwise.) The tasks-RCU API is quite compact, consisting only of call_rcu_tasks(), synchronize_rcu_tasks(), and rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines cannot be preempted, so these APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), respectively. In ``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these three APIs are therefore implemented by separate functions that check for voluntary context switches. ------------------------------------------------------------------------ As patch: ------------------------------------------------------------------------ diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index cccafdaa1f849..f511476b45506 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -2357,6 +2357,7 @@ section. #. `Sched Flavor (Historical)`_ #. `Sleepable RCU`_ #. `Tasks RCU`_ +#. `Tasks Trace RCU`_ Bottom-Half Flavor (Historical) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -2610,6 +2611,16 @@ critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections. +Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to +interact with them. + +Note well that involuntary context switches are *not* Tasks-RCU quiescent +states. After all, in preemptible kernels, a task executing code in a +trampoline might be preempted. In this case, the Tasks-RCU grace period +clearly cannot end until that task resumes and its execution leaves that +trampoline. This means, among other things, that cond_resched() does +not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs() +from softirq or rcu_tasks_classic_qs() otherwise.) The tasks-RCU API is quite compact, consisting only of call_rcu_tasks(), synchronize_rcu_tasks(), and @@ -2632,6 +2643,11 @@ moniker. And this operation is considered to be quite rude by real-time workloads that don't want their ``nohz_full`` CPUs receiving IPIs and by battery-powered systems that don't want their idle CPUs to be awakened. +Once kernel entry/exit and deep-idle functions have been properly tagged +``noinstr``, Tasks RCU can start paying attention to idle tasks (except +those that are idle from RCU's perspective) and then Tasks Rude RCU can +be removed from the kernel. + The tasks-rude-RCU API is also reader-marking-free and thus quite compact, consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(), and rcu_barrier_tasks_rude().
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 16f519914415..17d7ed5f3ae6 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -247,6 +247,37 @@ do { \ cond_resched(); \ } while (0) +/** + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states + * @old_ts: jiffies at start of processing. + * + * This helper is for long-running softirq handlers, such as NAPI threads in + * networking. The caller should initialize the variable passed in as @old_ts + * at the beginning of the softirq handler. When invoked frequently, this macro + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will + * provide both RCU and RCU-Tasks quiescent states. Note that this macro + * modifies its old_ts argument. + * + * Because regions of code that have disabled softirq act as RCU read-side + * critical sections, this macro should be invoked with softirq (and + * preemption) enabled. + * + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would + * have more chance to invoke schedule() calls and provide necessary quiescent + * states. As a contrast, calling cond_resched() only won't achieve the same + * effect because cond_resched() does not provide RCU-Tasks quiescent states. + */ +#define rcu_softirq_qs_periodic(old_ts) \ +do { \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ + time_after(jiffies, (old_ts) + HZ / 10)) { \ + preempt_disable(); \ + rcu_softirq_qs(); \ + preempt_enable(); \ + (old_ts) = jiffies; \ + } \ +} while (0) + /* * Infrastructure to implement the synchronize_() primitives in * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.