diff mbox series

[RFC,48/86] rcu: handle quiescent states for PREEMPT_RCU=n

Message ID 20231107215742.363031-49-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series Make the kernel preemptible | expand

Commit Message

Ankur Arora Nov. 7, 2023, 9:57 p.m. UTC
cond_resched() is used to provide urgent quiescent states for
read-side critical sections on PREEMPT_RCU=n configurations.
This was necessary because lacking preempt_count, there was no
way for the tick handler to know if we were executing in RCU
read-side critical section or not.

An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
reliably report quiescent states.

Accordingly, evaluate preempt_count() based quiescence in
rcu_flavor_sched_clock_irq().

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/tree_plugin.h |  3 ++-
 kernel/sched/core.c      | 15 +--------------
 2 files changed, 3 insertions(+), 15 deletions(-)

Comments

Paul E. McKenney Nov. 21, 2023, 12:38 a.m. UTC | #1
On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> cond_resched() is used to provide urgent quiescent states for
> read-side critical sections on PREEMPT_RCU=n configurations.
> This was necessary because lacking preempt_count, there was no
> way for the tick handler to know if we were executing in RCU
> read-side critical section or not.
> 
> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> reliably report quiescent states.
> 
> Accordingly, evaluate preempt_count() based quiescence in
> rcu_flavor_sched_clock_irq().
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/rcu/tree_plugin.h |  3 ++-
>  kernel/sched/core.c      | 15 +--------------
>  2 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index f87191e008ff..618f055f8028 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>   */
>  static void rcu_flavor_sched_clock_irq(int user)
>  {
> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {

This looks good.

>  		/*
>  		 * Get here if this CPU took its interrupt from user
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bf5df2b866df..15db5fb7acc7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>  		preempt_schedule_common();
>  		return 1;
>  	}
> -	/*
> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> -	 * whether the current CPU is in an RCU read-side critical section,
> -	 * so the tick can report quiescent states even for CPUs looping
> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> -	 * processes executing in kernel context might never report an
> -	 * RCU quiescent state.  Therefore, the following code causes
> -	 * cond_resched() to report a quiescent state, but only when RCU
> -	 * is in urgent need of one.
> -	 */
> -#ifndef CONFIG_PREEMPT_RCU
> -	rcu_all_qs();
> -#endif

But...

Suppose we have a long-running loop in the kernel that regularly
enables preemption, but only momentarily.  Then the added
rcu_flavor_sched_clock_irq() check would almost always fail, making
for extremely long grace periods.  Or did I miss a change that causes
preempt_enable() to help RCU out?

							Thanx, Paul

>  	return 0;
>  }
>  EXPORT_SYMBOL(_cond_resched);
> -- 
> 2.31.1
>
Ankur Arora Nov. 21, 2023, 3:26 a.m. UTC | #2
Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
>> cond_resched() is used to provide urgent quiescent states for
>> read-side critical sections on PREEMPT_RCU=n configurations.
>> This was necessary because lacking preempt_count, there was no
>> way for the tick handler to know if we were executing in RCU
>> read-side critical section or not.
>>
>> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
>> reliably report quiescent states.
>>
>> Accordingly, evaluate preempt_count() based quiescence in
>> rcu_flavor_sched_clock_irq().
>>
>> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/rcu/tree_plugin.h |  3 ++-
>>  kernel/sched/core.c      | 15 +--------------
>>  2 files changed, 3 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index f87191e008ff..618f055f8028 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>   */
>>  static void rcu_flavor_sched_clock_irq(int user)
>>  {
>> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
>> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
>> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
>
> This looks good.
>
>>  		/*
>>  		 * Get here if this CPU took its interrupt from user
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index bf5df2b866df..15db5fb7acc7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>>  		preempt_schedule_common();
>>  		return 1;
>>  	}
>> -	/*
>> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
>> -	 * whether the current CPU is in an RCU read-side critical section,
>> -	 * so the tick can report quiescent states even for CPUs looping
>> -	 * in kernel context.  In contrast, in non-preemptible kernels,
>> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
>> -	 * processes executing in kernel context might never report an
>> -	 * RCU quiescent state.  Therefore, the following code causes
>> -	 * cond_resched() to report a quiescent state, but only when RCU
>> -	 * is in urgent need of one.
>> -	 *      /
>> -#ifndef CONFIG_PREEMPT_RCU
>> -	rcu_all_qs();
>> -#endif
>
> But...
>
> Suppose we have a long-running loop in the kernel that regularly
> enables preemption, but only momentarily.  Then the added
> rcu_flavor_sched_clock_irq() check would almost always fail, making
> for extremely long grace periods.

So, my thinking was that if RCU wants to end a grace period, it would
force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
RCU always uses the the eager version) causing __schedule() to call
rcu_note_context_switch().
That's similar to the preempt_schedule_common() case in the
_cond_resched() above.

But if I see your point, RCU might just want to register a quiescent
state and for this long-running loop rcu_flavor_sched_clock_irq() does
seem to fall down.

> Or did I miss a change that causes preempt_enable() to help RCU out?

Something like this?

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index dc5125b9c36b..e50f358f1548 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -222,6 +222,8 @@ do { \
        barrier(); \
        if (unlikely(preempt_count_dec_and_test())) \
                __preempt_schedule(); \
+       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
+               rcu_all_qs(); \
 } while (0)

Though I do wonder about the likelihood of hitting the case you describe
and maybe instead of adding the check on every preempt_enable()
it might be better to instead force a context switch in the
rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)

Thanks

--
ankur
Zqiang Nov. 21, 2023, 3:55 a.m. UTC | #3
>
> cond_resched() is used to provide urgent quiescent states for
> read-side critical sections on PREEMPT_RCU=n configurations.
> This was necessary because lacking preempt_count, there was no
> way for the tick handler to know if we were executing in RCU
> read-side critical section or not.
>
> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> reliably report quiescent states.
>
> Accordingly, evaluate preempt_count() based quiescence in
> rcu_flavor_sched_clock_irq().
>
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/rcu/tree_plugin.h |  3 ++-
>  kernel/sched/core.c      | 15 +--------------
>  2 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index f87191e008ff..618f055f8028 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>   */
>  static void rcu_flavor_sched_clock_irq(int user)
>  {
> -       if (user || rcu_is_cpu_rrupt_from_idle()) {
> +       if (user || rcu_is_cpu_rrupt_from_idle() ||
> +           !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {


should ensure CONFIG_PREEMPT_COUNT=y
((IS_ENABLED(CONFIG_PREEMPT_COUNT) && !(preempt_count() &
(PREEMPT_MASK | SOFTIRQ_MASK)))

Thanks
Zqiang


>
>                 /*
>                  * Get here if this CPU took its interrupt from user
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bf5df2b866df..15db5fb7acc7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
>                 preempt_schedule_common();
>                 return 1;
>         }
> -       /*
> -        * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> -        * whether the current CPU is in an RCU read-side critical section,
> -        * so the tick can report quiescent states even for CPUs looping
> -        * in kernel context.  In contrast, in non-preemptible kernels,
> -        * RCU readers leave no in-memory hints, which means that CPU-bound
> -        * processes executing in kernel context might never report an
> -        * RCU quiescent state.  Therefore, the following code causes
> -        * cond_resched() to report a quiescent state, but only when RCU
> -        * is in urgent need of one.
> -        */
> -#ifndef CONFIG_PREEMPT_RCU
> -       rcu_all_qs();
> -#endif
> +
>         return 0;
>  }
>  EXPORT_SYMBOL(_cond_resched);
> --
> 2.31.1
>
Paul E. McKenney Nov. 21, 2023, 5:17 a.m. UTC | #4
On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> >> cond_resched() is used to provide urgent quiescent states for
> >> read-side critical sections on PREEMPT_RCU=n configurations.
> >> This was necessary because lacking preempt_count, there was no
> >> way for the tick handler to know if we were executing in RCU
> >> read-side critical section or not.
> >>
> >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> >> reliably report quiescent states.
> >>
> >> Accordingly, evaluate preempt_count() based quiescence in
> >> rcu_flavor_sched_clock_irq().
> >>
> >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  kernel/rcu/tree_plugin.h |  3 ++-
> >>  kernel/sched/core.c      | 15 +--------------
> >>  2 files changed, 3 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >> index f87191e008ff..618f055f8028 100644
> >> --- a/kernel/rcu/tree_plugin.h
> >> +++ b/kernel/rcu/tree_plugin.h
> >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> >>   */
> >>  static void rcu_flavor_sched_clock_irq(int user)
> >>  {
> >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> >
> > This looks good.
> >
> >>  		/*
> >>  		 * Get here if this CPU took its interrupt from user
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index bf5df2b866df..15db5fb7acc7 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> >>  		preempt_schedule_common();
> >>  		return 1;
> >>  	}
> >> -	/*
> >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> >> -	 * whether the current CPU is in an RCU read-side critical section,
> >> -	 * so the tick can report quiescent states even for CPUs looping
> >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> >> -	 * processes executing in kernel context might never report an
> >> -	 * RCU quiescent state.  Therefore, the following code causes
> >> -	 * cond_resched() to report a quiescent state, but only when RCU
> >> -	 * is in urgent need of one.
> >> -	 *      /
> >> -#ifndef CONFIG_PREEMPT_RCU
> >> -	rcu_all_qs();
> >> -#endif
> >
> > But...
> >
> > Suppose we have a long-running loop in the kernel that regularly
> > enables preemption, but only momentarily.  Then the added
> > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > for extremely long grace periods.
> 
> So, my thinking was that if RCU wants to end a grace period, it would
> force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> RCU always uses the the eager version) causing __schedule() to call
> rcu_note_context_switch().
> That's similar to the preempt_schedule_common() case in the
> _cond_resched() above.

But that requires IPIing that CPU, correct?

> But if I see your point, RCU might just want to register a quiescent
> state and for this long-running loop rcu_flavor_sched_clock_irq() does
> seem to fall down.
> 
> > Or did I miss a change that causes preempt_enable() to help RCU out?
> 
> Something like this?
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index dc5125b9c36b..e50f358f1548 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -222,6 +222,8 @@ do { \
>         barrier(); \
>         if (unlikely(preempt_count_dec_and_test())) \
>                 __preempt_schedule(); \
> +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> +               rcu_all_qs(); \
>  } while (0)

Or maybe something like this to lighten the load a bit:

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
			rcu_all_qs(); \
	} \
} while (0)

And at that point, we should be able to drop the PREEMPT_MASK, not
that it makes any difference that I am aware of:

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
		    !(preempt_count() & SOFTIRQ_MASK)) \
			rcu_all_qs(); \
	} \
} while (0)

Except that we can migrate as soon as that preempt_count_dec_and_test()
returns.  And that rcu_all_qs() disables and re-enables preemption,
which will result in undesired recursion.  Sigh.

So maybe something like this:

#define preempt_enable() \
do { \
	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
	    !(preempt_count() & SOFTIRQ_MASK)) \
		rcu_all_qs(); \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
	} \
} while (0)

Then rcu_all_qs() becomes something like this:

void rcu_all_qs(void)
{
	unsigned long flags;

	/* Load rcu_urgent_qs before other flags. */
	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
		return;
	this_cpu_write(rcu_data.rcu_urgent_qs, false);
	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
		local_irq_save(flags);
		rcu_momentary_dyntick_idle();
		local_irq_restore(flags);
	}
	rcu_qs();
}
EXPORT_SYMBOL_GPL(rcu_all_qs);

> Though I do wonder about the likelihood of hitting the case you describe
> and maybe instead of adding the check on every preempt_enable()
> it might be better to instead force a context switch in the
> rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)

Maybe.  But rcu_all_qs() is way lighter weight than a context switch.

							Thanx, Paul
Paul E. McKenney Nov. 21, 2023, 5:34 a.m. UTC | #5
On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > 
> > Paul E. McKenney <paulmck@kernel.org> writes:
> > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > >> cond_resched() is used to provide urgent quiescent states for
> > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > >> This was necessary because lacking preempt_count, there was no
> > >> way for the tick handler to know if we were executing in RCU
> > >> read-side critical section or not.
> > >>
> > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > >> reliably report quiescent states.
> > >>
> > >> Accordingly, evaluate preempt_count() based quiescence in
> > >> rcu_flavor_sched_clock_irq().
> > >>
> > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > >> ---
> > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > >>  kernel/sched/core.c      | 15 +--------------
> > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > >>
> > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > >> index f87191e008ff..618f055f8028 100644
> > >> --- a/kernel/rcu/tree_plugin.h
> > >> +++ b/kernel/rcu/tree_plugin.h
> > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > >>   */
> > >>  static void rcu_flavor_sched_clock_irq(int user)
> > >>  {
> > >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> > >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > >
> > > This looks good.
> > >
> > >>  		/*
> > >>  		 * Get here if this CPU took its interrupt from user
> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >> index bf5df2b866df..15db5fb7acc7 100644
> > >> --- a/kernel/sched/core.c
> > >> +++ b/kernel/sched/core.c
> > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > >>  		preempt_schedule_common();
> > >>  		return 1;
> > >>  	}
> > >> -	/*
> > >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > >> -	 * whether the current CPU is in an RCU read-side critical section,
> > >> -	 * so the tick can report quiescent states even for CPUs looping
> > >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> > >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> > >> -	 * processes executing in kernel context might never report an
> > >> -	 * RCU quiescent state.  Therefore, the following code causes
> > >> -	 * cond_resched() to report a quiescent state, but only when RCU
> > >> -	 * is in urgent need of one.
> > >> -	 *      /
> > >> -#ifndef CONFIG_PREEMPT_RCU
> > >> -	rcu_all_qs();
> > >> -#endif
> > >
> > > But...
> > >
> > > Suppose we have a long-running loop in the kernel that regularly
> > > enables preemption, but only momentarily.  Then the added
> > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > for extremely long grace periods.
> > 
> > So, my thinking was that if RCU wants to end a grace period, it would
> > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > RCU always uses the the eager version) causing __schedule() to call
> > rcu_note_context_switch().
> > That's similar to the preempt_schedule_common() case in the
> > _cond_resched() above.
> 
> But that requires IPIing that CPU, correct?
> 
> > But if I see your point, RCU might just want to register a quiescent
> > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > seem to fall down.
> > 
> > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > 
> > Something like this?
> > 
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index dc5125b9c36b..e50f358f1548 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -222,6 +222,8 @@ do { \
> >         barrier(); \
> >         if (unlikely(preempt_count_dec_and_test())) \
> >                 __preempt_schedule(); \
> > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > +               rcu_all_qs(); \
> >  } while (0)
> 
> Or maybe something like this to lighten the load a bit:
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> 			rcu_all_qs(); \
> 	} \
> } while (0)
> 
> And at that point, we should be able to drop the PREEMPT_MASK, not
> that it makes any difference that I am aware of:
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 		    !(preempt_count() & SOFTIRQ_MASK)) \
> 			rcu_all_qs(); \
> 	} \
> } while (0)
> 
> Except that we can migrate as soon as that preempt_count_dec_and_test()
> returns.  And that rcu_all_qs() disables and re-enables preemption,
> which will result in undesired recursion.  Sigh.
> 
> So maybe something like this:
> 
> #define preempt_enable() \
> do { \
> 	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 	    !(preempt_count() & SOFTIRQ_MASK)) \

Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
but check for equality to something like (1UL << PREEMPT_SHIFT).

Clearly time to sleep.  :-/

							Thanx, Paul

> 		rcu_all_qs(); \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 	} \
> } while (0)
> 
> Then rcu_all_qs() becomes something like this:
> 
> void rcu_all_qs(void)
> {
> 	unsigned long flags;
> 
> 	/* Load rcu_urgent_qs before other flags. */
> 	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> 		return;
> 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
> 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> 		local_irq_save(flags);
> 		rcu_momentary_dyntick_idle();
> 		local_irq_restore(flags);
> 	}
> 	rcu_qs();
> }
> EXPORT_SYMBOL_GPL(rcu_all_qs);
> 
> > Though I do wonder about the likelihood of hitting the case you describe
> > and maybe instead of adding the check on every preempt_enable()
> > it might be better to instead force a context switch in the
> > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> 
> Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> 
> 							Thanx, Paul
Zqiang Nov. 21, 2023, 6:13 a.m. UTC | #6
>
> On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > >
> > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > >> cond_resched() is used to provide urgent quiescent states for
> > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > >> This was necessary because lacking preempt_count, there was no
> > > >> way for the tick handler to know if we were executing in RCU
> > > >> read-side critical section or not.
> > > >>
> > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > >> reliably report quiescent states.
> > > >>
> > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > >> rcu_flavor_sched_clock_irq().
> > > >>
> > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > >> ---
> > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > >>  kernel/sched/core.c      | 15 +--------------
> > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > >>
> > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > >> index f87191e008ff..618f055f8028 100644
> > > >> --- a/kernel/rcu/tree_plugin.h
> > > >> +++ b/kernel/rcu/tree_plugin.h
> > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > >>   */
> > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > >>  {
> > > >> -        if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > >> +        if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > >> +            !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > >
> > > > This looks good.
> > > >
> > > >>                  /*
> > > >>                   * Get here if this CPU took its interrupt from user
> > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > >> --- a/kernel/sched/core.c
> > > >> +++ b/kernel/sched/core.c
> > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > >>                  preempt_schedule_common();
> > > >>                  return 1;
> > > >>          }
> > > >> -        /*
> > > >> -         * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > >> -         * whether the current CPU is in an RCU read-side critical section,
> > > >> -         * so the tick can report quiescent states even for CPUs looping
> > > >> -         * in kernel context.  In contrast, in non-preemptible kernels,
> > > >> -         * RCU readers leave no in-memory hints, which means that CPU-bound
> > > >> -         * processes executing in kernel context might never report an
> > > >> -         * RCU quiescent state.  Therefore, the following code causes
> > > >> -         * cond_resched() to report a quiescent state, but only when RCU
> > > >> -         * is in urgent need of one.
> > > >> -         *      /
> > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > >> -        rcu_all_qs();
> > > >> -#endif
> > > >
> > > > But...
> > > >
> > > > Suppose we have a long-running loop in the kernel that regularly
> > > > enables preemption, but only momentarily.  Then the added
> > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > for extremely long grace periods.
> > >
> > > So, my thinking was that if RCU wants to end a grace period, it would
> > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > RCU always uses the the eager version) causing __schedule() to call
> > > rcu_note_context_switch().
> > > That's similar to the preempt_schedule_common() case in the
> > > _cond_resched() above.
> >
> > But that requires IPIing that CPU, correct?
> >
> > > But if I see your point, RCU might just want to register a quiescent
> > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > seem to fall down.
> > >
> > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > >
> > > Something like this?
> > >
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index dc5125b9c36b..e50f358f1548 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -222,6 +222,8 @@ do { \
> > >         barrier(); \
> > >         if (unlikely(preempt_count_dec_and_test())) \
> > >                 __preempt_schedule(); \
> > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > +               rcu_all_qs(); \
> > >  } while (0)
> >
> > Or maybe something like this to lighten the load a bit:
> >
> > #define preempt_enable() \
> > do { \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >                   !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> >                       rcu_all_qs(); \
> >       } \
> > } while (0)
> >
> > And at that point, we should be able to drop the PREEMPT_MASK, not
> > that it makes any difference that I am aware of:
> >
> > #define preempt_enable() \
> > do { \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >                   !(preempt_count() & SOFTIRQ_MASK)) \
> >                       rcu_all_qs(); \
> >       } \
> > } while (0)
> >
> > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > which will result in undesired recursion.  Sigh.
> >
> > So maybe something like this:
> >
> > #define preempt_enable() \
> > do { \
> >       if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> >           !(preempt_count() & SOFTIRQ_MASK)) \
>
> Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> but check for equality to something like (1UL << PREEMPT_SHIFT).
>

For PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y kernels
for report QS in preempt_enable(),  we can refer to this:

void rcu_read_unlock_strict(void)
{
        struct rcu_data *rdp;

        if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
                return;
        rdp = this_cpu_ptr(&rcu_data);
        rdp->cpu_no_qs.b.norm = false;
        rcu_report_qs_rdp(rdp);
        udelay(rcu_unlock_delay);
}

The rcu critical section may be in the NMI handler  needs to be considered.


Thanks
Zqiang



>
> Clearly time to sleep.  :-/
>
>                                                         Thanx, Paul
>
> >               rcu_all_qs(); \
> >       barrier(); \
> >       if (unlikely(preempt_count_dec_and_test())) { \
> >               __preempt_schedule(); \
> >       } \
> > } while (0)
> >
> > Then rcu_all_qs() becomes something like this:
> >
> > void rcu_all_qs(void)
> > {
> >       unsigned long flags;
> >
> >       /* Load rcu_urgent_qs before other flags. */
> >       if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> >               return;
> >       this_cpu_write(rcu_data.rcu_urgent_qs, false);
> >       if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> >               local_irq_save(flags);
> >               rcu_momentary_dyntick_idle();
> >               local_irq_restore(flags);
> >       }
> >       rcu_qs();
> > }
> > EXPORT_SYMBOL_GPL(rcu_all_qs);
> >
> > > Though I do wonder about the likelihood of hitting the case you describe
> > > and maybe instead of adding the check on every preempt_enable()
> > > it might be better to instead force a context switch in the
> > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> >
> > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> >
> >                                                       Thanx, Paul
Paul E. McKenney Nov. 21, 2023, 3:32 p.m. UTC | #7
On Tue, Nov 21, 2023 at 02:13:53PM +0800, Z qiang wrote:
> >
> > On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > > >
> > > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > > >> cond_resched() is used to provide urgent quiescent states for
> > > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > > >> This was necessary because lacking preempt_count, there was no
> > > > >> way for the tick handler to know if we were executing in RCU
> > > > >> read-side critical section or not.
> > > > >>
> > > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > > >> reliably report quiescent states.
> > > > >>
> > > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > > >> rcu_flavor_sched_clock_irq().
> > > > >>
> > > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > > >> ---
> > > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > > >>  kernel/sched/core.c      | 15 +--------------
> > > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > > >>
> > > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > >> index f87191e008ff..618f055f8028 100644
> > > > >> --- a/kernel/rcu/tree_plugin.h
> > > > >> +++ b/kernel/rcu/tree_plugin.h
> > > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > > >>   */
> > > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > > >>  {
> > > > >> -        if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > > >> +        if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > > >> +            !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > > >
> > > > > This looks good.
> > > > >
> > > > >>                  /*
> > > > >>                   * Get here if this CPU took its interrupt from user
> > > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > > >> --- a/kernel/sched/core.c
> > > > >> +++ b/kernel/sched/core.c
> > > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > > >>                  preempt_schedule_common();
> > > > >>                  return 1;
> > > > >>          }
> > > > >> -        /*
> > > > >> -         * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > > >> -         * whether the current CPU is in an RCU read-side critical section,
> > > > >> -         * so the tick can report quiescent states even for CPUs looping
> > > > >> -         * in kernel context.  In contrast, in non-preemptible kernels,
> > > > >> -         * RCU readers leave no in-memory hints, which means that CPU-bound
> > > > >> -         * processes executing in kernel context might never report an
> > > > >> -         * RCU quiescent state.  Therefore, the following code causes
> > > > >> -         * cond_resched() to report a quiescent state, but only when RCU
> > > > >> -         * is in urgent need of one.
> > > > >> -         *      /
> > > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > > >> -        rcu_all_qs();
> > > > >> -#endif
> > > > >
> > > > > But...
> > > > >
> > > > > Suppose we have a long-running loop in the kernel that regularly
> > > > > enables preemption, but only momentarily.  Then the added
> > > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > > for extremely long grace periods.
> > > >
> > > > So, my thinking was that if RCU wants to end a grace period, it would
> > > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > > RCU always uses the the eager version) causing __schedule() to call
> > > > rcu_note_context_switch().
> > > > That's similar to the preempt_schedule_common() case in the
> > > > _cond_resched() above.
> > >
> > > But that requires IPIing that CPU, correct?
> > >
> > > > But if I see your point, RCU might just want to register a quiescent
> > > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > > seem to fall down.
> > > >
> > > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > > >
> > > > Something like this?
> > > >
> > > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > > index dc5125b9c36b..e50f358f1548 100644
> > > > --- a/include/linux/preempt.h
> > > > +++ b/include/linux/preempt.h
> > > > @@ -222,6 +222,8 @@ do { \
> > > >         barrier(); \
> > > >         if (unlikely(preempt_count_dec_and_test())) \
> > > >                 __preempt_schedule(); \
> > > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > > +               rcu_all_qs(); \
> > > >  } while (0)
> > >
> > > Or maybe something like this to lighten the load a bit:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >                   !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > >                       rcu_all_qs(); \
> > >       } \
> > > } while (0)
> > >
> > > And at that point, we should be able to drop the PREEMPT_MASK, not
> > > that it makes any difference that I am aware of:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >               if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >                   !(preempt_count() & SOFTIRQ_MASK)) \
> > >                       rcu_all_qs(); \
> > >       } \
> > > } while (0)
> > >
> > > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > > which will result in undesired recursion.  Sigh.
> > >
> > > So maybe something like this:
> > >
> > > #define preempt_enable() \
> > > do { \
> > >       if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > >           !(preempt_count() & SOFTIRQ_MASK)) \
> >
> > Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> > but check for equality to something like (1UL << PREEMPT_SHIFT).
> >
> 
> For PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y kernels
> for report QS in preempt_enable(),  we can refer to this:
> 
> void rcu_read_unlock_strict(void)
> {
>         struct rcu_data *rdp;
> 
>         if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
>                 return;
>         rdp = this_cpu_ptr(&rcu_data);
>         rdp->cpu_no_qs.b.norm = false;
>         rcu_report_qs_rdp(rdp);
>         udelay(rcu_unlock_delay);
> }
> 
> The rcu critical section may be in the NMI handler  needs to be considered.

You are quite right, though one advantage of leveraging preempt_enable()
is that it cannot really enable preemption in an NMI handler.
But yes, that might need to be accounted for in the comparison with
preempt_count().

The actual condition needs to also allow for the possibility that
this preempt_enable() happened in a kernel built with preemptible RCU.
And probably a few other things that I have not yet thought of.

For one thing, rcu_implicit_dynticks_qs() might need adjustment.
Though I am currently hoping that it will still be able to enlist the
help of other things, for example, preempt_enable() and local_bh_enable().

Yes, it is the easiest thing in the world to just whip out the
resched_cpu() hammer earlier in the grace period, and maybe that is the
eventual solution.  But I would like to try avoiding the extra IPIs if
that can be done reasonably.  ;-)

							Thanx, Paul

> Thanks
> Zqiang
> 
> 
> 
> >
> > Clearly time to sleep.  :-/
> >
> >                                                         Thanx, Paul
> >
> > >               rcu_all_qs(); \
> > >       barrier(); \
> > >       if (unlikely(preempt_count_dec_and_test())) { \
> > >               __preempt_schedule(); \
> > >       } \
> > > } while (0)
> > >
> > > Then rcu_all_qs() becomes something like this:
> > >
> > > void rcu_all_qs(void)
> > > {
> > >       unsigned long flags;
> > >
> > >       /* Load rcu_urgent_qs before other flags. */
> > >       if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> > >               return;
> > >       this_cpu_write(rcu_data.rcu_urgent_qs, false);
> > >       if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> > >               local_irq_save(flags);
> > >               rcu_momentary_dyntick_idle();
> > >               local_irq_restore(flags);
> > >       }
> > >       rcu_qs();
> > > }
> > > EXPORT_SYMBOL_GPL(rcu_all_qs);
> > >
> > > > Though I do wonder about the likelihood of hitting the case you describe
> > > > and maybe instead of adding the check on every preempt_enable()
> > > > it might be better to instead force a context switch in the
> > > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> > >
> > > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> > >
> > >                                                       Thanx, Paul
Paul E. McKenney Nov. 21, 2023, 7:25 p.m. UTC | #8
On Mon, Nov 20, 2023 at 09:34:05PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote:
> > > 
> > > Paul E. McKenney <paulmck@kernel.org> writes:
> > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote:
> > > >> cond_resched() is used to provide urgent quiescent states for
> > > >> read-side critical sections on PREEMPT_RCU=n configurations.
> > > >> This was necessary because lacking preempt_count, there was no
> > > >> way for the tick handler to know if we were executing in RCU
> > > >> read-side critical section or not.
> > > >>
> > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to
> > > >> reliably report quiescent states.
> > > >>
> > > >> Accordingly, evaluate preempt_count() based quiescence in
> > > >> rcu_flavor_sched_clock_irq().
> > > >>
> > > >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > >> ---
> > > >>  kernel/rcu/tree_plugin.h |  3 ++-
> > > >>  kernel/sched/core.c      | 15 +--------------
> > > >>  2 files changed, 3 insertions(+), 15 deletions(-)
> > > >>
> > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > >> index f87191e008ff..618f055f8028 100644
> > > >> --- a/kernel/rcu/tree_plugin.h
> > > >> +++ b/kernel/rcu/tree_plugin.h
> > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > > >>   */
> > > >>  static void rcu_flavor_sched_clock_irq(int user)
> > > >>  {
> > > >> -	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > >> +	if (user || rcu_is_cpu_rrupt_from_idle() ||
> > > >> +	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> > > >
> > > > This looks good.
> > > >
> > > >>  		/*
> > > >>  		 * Get here if this CPU took its interrupt from user
> > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > >> index bf5df2b866df..15db5fb7acc7 100644
> > > >> --- a/kernel/sched/core.c
> > > >> +++ b/kernel/sched/core.c
> > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void)
> > > >>  		preempt_schedule_common();
> > > >>  		return 1;
> > > >>  	}
> > > >> -	/*
> > > >> -	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> > > >> -	 * whether the current CPU is in an RCU read-side critical section,
> > > >> -	 * so the tick can report quiescent states even for CPUs looping
> > > >> -	 * in kernel context.  In contrast, in non-preemptible kernels,
> > > >> -	 * RCU readers leave no in-memory hints, which means that CPU-bound
> > > >> -	 * processes executing in kernel context might never report an
> > > >> -	 * RCU quiescent state.  Therefore, the following code causes
> > > >> -	 * cond_resched() to report a quiescent state, but only when RCU
> > > >> -	 * is in urgent need of one.
> > > >> -	 *      /
> > > >> -#ifndef CONFIG_PREEMPT_RCU
> > > >> -	rcu_all_qs();
> > > >> -#endif
> > > >
> > > > But...
> > > >
> > > > Suppose we have a long-running loop in the kernel that regularly
> > > > enables preemption, but only momentarily.  Then the added
> > > > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > > > for extremely long grace periods.
> > > 
> > > So, my thinking was that if RCU wants to end a grace period, it would
> > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions
> > > RCU always uses the the eager version) causing __schedule() to call
> > > rcu_note_context_switch().
> > > That's similar to the preempt_schedule_common() case in the
> > > _cond_resched() above.
> > 
> > But that requires IPIing that CPU, correct?
> > 
> > > But if I see your point, RCU might just want to register a quiescent
> > > state and for this long-running loop rcu_flavor_sched_clock_irq() does
> > > seem to fall down.
> > > 
> > > > Or did I miss a change that causes preempt_enable() to help RCU out?
> > > 
> > > Something like this?
> > > 
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index dc5125b9c36b..e50f358f1548 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -222,6 +222,8 @@ do { \
> > >         barrier(); \
> > >         if (unlikely(preempt_count_dec_and_test())) \
> > >                 __preempt_schedule(); \
> > > +       if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > > +               rcu_all_qs(); \
> > >  } while (0)
> > 
> > Or maybe something like this to lighten the load a bit:
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 		    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \
> > 			rcu_all_qs(); \
> > 	} \
> > } while (0)
> > 
> > And at that point, we should be able to drop the PREEMPT_MASK, not
> > that it makes any difference that I am aware of:
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 		if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 		    !(preempt_count() & SOFTIRQ_MASK)) \
> > 			rcu_all_qs(); \
> > 	} \
> > } while (0)
> > 
> > Except that we can migrate as soon as that preempt_count_dec_and_test()
> > returns.  And that rcu_all_qs() disables and re-enables preemption,
> > which will result in undesired recursion.  Sigh.
> > 
> > So maybe something like this:
> > 
> > #define preempt_enable() \
> > do { \
> > 	if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 	    !(preempt_count() & SOFTIRQ_MASK)) \
> 
> Sigh.  This needs to include (PREEMPT_MASK | SOFTIRQ_MASK),
> but check for equality to something like (1UL << PREEMPT_SHIFT).
> 
> Clearly time to sleep.  :-/

Maybe this might actually work:

#define preempt_enable() \
do { \
	barrier(); \
	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
	    !irqs_disabled()) \
		rcu_all_qs(); \
	if (unlikely(preempt_count_dec_and_test())) { \
		__preempt_schedule(); \
	} \
} while (0)

And the rcu_all_qs() below might also work.

							Thanx, Paul

> > 		rcu_all_qs(); \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 	} \
> > } while (0)
> > 
> > Then rcu_all_qs() becomes something like this:
> > 
> > void rcu_all_qs(void)
> > {
> > 	unsigned long flags;
> > 
> > 	/* Load rcu_urgent_qs before other flags. */
> > 	if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
> > 		return;
> > 	this_cpu_write(rcu_data.rcu_urgent_qs, false);
> > 	if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
> > 		local_irq_save(flags);
> > 		rcu_momentary_dyntick_idle();
> > 		local_irq_restore(flags);
> > 	}
> > 	rcu_qs();
> > }
> > EXPORT_SYMBOL_GPL(rcu_all_qs);
> > 
> > > Though I do wonder about the likelihood of hitting the case you describe
> > > and maybe instead of adding the check on every preempt_enable()
> > > it might be better to instead force a context switch in the
> > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.)
> > 
> > Maybe.  But rcu_all_qs() is way lighter weight than a context switch.
> > 
> > 							Thanx, Paul
Peter Zijlstra Nov. 21, 2023, 8:30 p.m. UTC | #9
On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> 	    !irqs_disabled()) \
> 		rcu_all_qs(); \
> 	if (unlikely(preempt_count_dec_and_test())) { \
> 		__preempt_schedule(); \
> 	} \
> } while (0)

Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
the minimal thing it is today, this will make it blow up into something
giant again.
Paul E. McKenney Nov. 21, 2023, 9:14 p.m. UTC | #10
On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > 	    !irqs_disabled()) \
> > 		rcu_all_qs(); \
> > 	if (unlikely(preempt_count_dec_and_test())) { \
> > 		__preempt_schedule(); \
> > 	} \
> > } while (0)
> 
> Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> the minimal thing it is today, this will make it blow up into something
> giant again.

IPIs, then.  Or retaining some cond_resched() calls, as needed.

Or is there another way?

							Thanx, Paul
Steven Rostedt Nov. 21, 2023, 9:38 p.m. UTC | #11
On Tue, 21 Nov 2023 13:14:16 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:  
> > > #define preempt_enable() \
> > > do { \
> > > 	barrier(); \
> > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > 	    !irqs_disabled()) \

Could we make the above an else case of the below if ?

> > > 		rcu_all_qs(); \
> > > 	if (unlikely(preempt_count_dec_and_test())) { \
> > > 		__preempt_schedule(); \
> > > 	} \
> > > } while (0)  
> > 
> > Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> > the minimal thing it is today, this will make it blow up into something
> > giant again.  

Note, the above is only true with "CONFIG_PREEMPT_RCU is not set", which
keeps the preempt_count() for preemptable kernels with PREEMPT_RCU still minimal.

-- Steve
Paul E. McKenney Nov. 21, 2023, 10:26 p.m. UTC | #12
On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 13:14:16 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:  
> > > > #define preempt_enable() \
> > > > do { \
> > > > 	barrier(); \
> > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > 	    !irqs_disabled()) \
> 
> Could we make the above an else case of the below if ?

Wouldn't that cause the above preempt_count() test to always fail?

Another approach is to bury the test in preempt_count_dec_and_test(),
but I suspect that this would not make Peter any more happy than my
earlier suggestion.  ;-)

> > > > 		rcu_all_qs(); \
> > > > 	if (unlikely(preempt_count_dec_and_test())) { \
> > > > 		__preempt_schedule(); \
> > > > 	} \
> > > > } while (0)  
> > > 
> > > Aaaaahhh, please no. We spend so much time reducing preempt_enable() to
> > > the minimal thing it is today, this will make it blow up into something
> > > giant again.  
> 
> Note, the above is only true with "CONFIG_PREEMPT_RCU is not set", which
> keeps the preempt_count() for preemptable kernels with PREEMPT_RCU still minimal.

Agreed, and there is probably some workload that does not like this.
After all, current CONFIG_PREEMPT_DYNAMIC=y booted with preempt=none
would have those cond_resched() invocations.  I was leary of checking
dynamic information, but maybe sched_feat() is faster than I am thinking?
(It should be with the static_branch, but not sure about the other two
access modes.)

							Thanx, Paul
Steven Rostedt Nov. 21, 2023, 10:52 p.m. UTC | #13
On Tue, 21 Nov 2023 14:26:33 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> > On Tue, 21 Nov 2023 13:14:16 -0800
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >   
> > > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:  
> > > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:    
> > > > > #define preempt_enable() \
> > > > > do { \
> > > > > 	barrier(); \
> > > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > > 	    !irqs_disabled()) \  
> > 
> > Could we make the above an else case of the below if ?  
> 
> Wouldn't that cause the above preempt_count() test to always fail?

preempt_count_dec_and_test() returns true if preempt_count() is zero, which
happens only if NEED_RESCHED is set, and the rest of preempt_count() is not
set. (NEED_RESCHED bit in preempt_count() is really the inverse of
NEED_RESCHED). Do we need to call rcu_all_qs() when we call the scheduler?
Isn't scheduling a quiescent state for most RCU flavors?

I thought this was to help move along the quiescent states without added
cond_resched() around, which has:

int __sched __cond_resched(void)
{
	if (should_resched(0)) {
		preempt_schedule_common();
		return 1;
	}
	/*
	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
	 * whether the current CPU is in an RCU read-side critical section,
	 * so the tick can report quiescent states even for CPUs looping
	 * in kernel context.  In contrast, in non-preemptible kernels,
	 * RCU readers leave no in-memory hints, which means that CPU-bound
	 * processes executing in kernel context might never report an
	 * RCU quiescent state.  Therefore, the following code causes
	 * cond_resched() to report a quiescent state, but only when RCU
	 * is in urgent need of one.
	 */
#ifndef CONFIG_PREEMPT_RCU
	rcu_all_qs();
#endif
	return 0;
}

Where if we schedule, we don't call rcu_all_qs().

I stand by that being in the else statement. It looks like that would keep
the previous work flow.

-- Steve
Paul E. McKenney Nov. 22, 2023, 12:01 a.m. UTC | #14
On Tue, Nov 21, 2023 at 05:52:09PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 14:26:33 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 21, 2023 at 04:38:34PM -0500, Steven Rostedt wrote:
> > > On Tue, 21 Nov 2023 13:14:16 -0800
> > > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >   
> > > > On Tue, Nov 21, 2023 at 09:30:49PM +0100, Peter Zijlstra wrote:  
> > > > > On Tue, Nov 21, 2023 at 11:25:18AM -0800, Paul E. McKenney wrote:    
> > > > > > #define preempt_enable() \
> > > > > > do { \
> > > > > > 	barrier(); \
> > > > > > 	if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && raw_cpu_read(rcu_data.rcu_urgent_qs) && \
> > > > > > 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) &&
> > > > > > 	    !irqs_disabled()) \  
> > > 
> > > Could we make the above an else case of the below if ?  
> > 
> > Wouldn't that cause the above preempt_count() test to always fail?
> 
> preempt_count_dec_and_test() returns true if preempt_count() is zero, which
> happens only if NEED_RESCHED is set, and the rest of preempt_count() is not
> set. (NEED_RESCHED bit in preempt_count() is really the inverse of
> NEED_RESCHED). Do we need to call rcu_all_qs() when we call the scheduler?
> Isn't scheduling a quiescent state for most RCU flavors?
> 
> I thought this was to help move along the quiescent states without added
> cond_resched() around, which has:
> 
> int __sched __cond_resched(void)
> {
> 	if (should_resched(0)) {
> 		preempt_schedule_common();
> 		return 1;
> 	}
> 	/*
> 	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
> 	 * whether the current CPU is in an RCU read-side critical section,
> 	 * so the tick can report quiescent states even for CPUs looping
> 	 * in kernel context.  In contrast, in non-preemptible kernels,
> 	 * RCU readers leave no in-memory hints, which means that CPU-bound
> 	 * processes executing in kernel context might never report an
> 	 * RCU quiescent state.  Therefore, the following code causes
> 	 * cond_resched() to report a quiescent state, but only when RCU
> 	 * is in urgent need of one.
> 	 */
> #ifndef CONFIG_PREEMPT_RCU
> 	rcu_all_qs();
> #endif
> 	return 0;
> }
> 
> Where if we schedule, we don't call rcu_all_qs().

True enough, but in this __cond_resched() case we know that we are in
an RCU quiescent state regardless of what should_resched() says.

In contrast, with preempt_enable(), we are only in a quiescent state
if __preempt_count_dec_and_test() returns true, and even then only if
interrupts are enabled.

> I stand by that being in the else statement. It looks like that would keep
> the previous work flow.

Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
so that when __preempt_count_dec_and_test() returns false, we might
still be in an RCU quiescent state in the case where there was no need
to reschedule.  Good point!

In which case...

#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
	else if (!sched_feat(FORCE_PREEMPT) && \
		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
		 !irqs_disabled()) \
) \
			rcu_all_qs(); \
} while (0)

Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
condition could be pushed down into rcu_all_qs(), depending on whether
Peter's objection was call-site object code size, execution path length,
or both.  ;-)

If the objection is both call-site object code size and execution path
length, then maybe all but the preempt_count() check should be pushed
into rcu_all_qs().

Was that what you had in mind, or am I missing your point?

							Thanx, Paul
Steven Rostedt Nov. 22, 2023, 12:12 a.m. UTC | #15
On Tue, 21 Nov 2023 16:01:24 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> 
> > I stand by that being in the else statement. It looks like that would keep
> > the previous work flow.  
> 
> Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
> so that when __preempt_count_dec_and_test() returns false, we might
> still be in an RCU quiescent state in the case where there was no need
> to reschedule.  Good point!
> 
> In which case...
> 
> #define preempt_enable() \
> do { \
> 	barrier(); \
> 	if (unlikely(preempt_count_dec_and_test())) \
> 		__preempt_schedule(); \
> 	else if (!sched_feat(FORCE_PREEMPT) && \
> 		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
> 		 !irqs_disabled()) \
> ) \
> 			rcu_all_qs(); \
> } while (0)
> 
> Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
> condition could be pushed down into rcu_all_qs(), depending on whether
> Peter's objection was call-site object code size, execution path length,
> or both.  ;-)
> 
> If the objection is both call-site object code size and execution path
> length, then maybe all but the preempt_count() check should be pushed
> into rcu_all_qs().
> 
> Was that what you had in mind, or am I missing your point?

Yes, that is what I had in mind.

Should we also keep the !IS_ENABLED(CONFIG_PREEMPT_RCU) check, which makes
the entire thing optimized out when PREEMPT_RCU is enabled?

-- Steve
Paul E. McKenney Nov. 22, 2023, 1:09 a.m. UTC | #16
On Tue, Nov 21, 2023 at 07:12:32PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 16:01:24 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > 
> > > I stand by that being in the else statement. It looks like that would keep
> > > the previous work flow.  
> > 
> > Ah, because PREEMPT_NEED_RESCHED is zero when we need to reschedule,
> > so that when __preempt_count_dec_and_test() returns false, we might
> > still be in an RCU quiescent state in the case where there was no need
> > to reschedule.  Good point!
> > 
> > In which case...
> > 
> > #define preempt_enable() \
> > do { \
> > 	barrier(); \
> > 	if (unlikely(preempt_count_dec_and_test())) \
> > 		__preempt_schedule(); \
> > 	else if (!sched_feat(FORCE_PREEMPT) && \
> > 		 (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
> > 		 !irqs_disabled()) \
> > ) \
> > 			rcu_all_qs(); \
> > } while (0)
> > 
> > Keeping rcu_all_qs() pretty much as is.  Or some or all of the "else if"
> > condition could be pushed down into rcu_all_qs(), depending on whether
> > Peter's objection was call-site object code size, execution path length,
> > or both.  ;-)
> > 
> > If the objection is both call-site object code size and execution path
> > length, then maybe all but the preempt_count() check should be pushed
> > into rcu_all_qs().
> > 
> > Was that what you had in mind, or am I missing your point?
> 
> Yes, that is what I had in mind.
> 
> Should we also keep the !IS_ENABLED(CONFIG_PREEMPT_RCU) check, which makes
> the entire thing optimized out when PREEMPT_RCU is enabled?

I substituted the !sched_feat(FORCE_PREEMPT) for this because as I
understand it, sites currently using CONFIG_PREEMPT_DYNAMIC=y (which is
the default) and booting with preempt=none will currently have their
grace periods helped by cond_resched(), so likely also need help,
perhaps also from preempt_enable().

							Thanx, Paul
Thomas Gleixner Nov. 28, 2023, 5:04 p.m. UTC | #17
Paul!

On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
> But...
>
> Suppose we have a long-running loop in the kernel that regularly
> enables preemption, but only momentarily.  Then the added
> rcu_flavor_sched_clock_irq() check would almost always fail, making
> for extremely long grace periods.  Or did I miss a change that causes
> preempt_enable() to help RCU out?

So first of all this is not any different from today and even with
RCU_PREEMPT=y a tight loop:

    do {
    	preempt_disable();
        do_stuff();
        preempt_enable();
    }

will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
it can do is to force reschedule/preemption after some time, which in
turn ends up in a QS.

The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
that at all because the preempt_enable() is a NOOP and there is no
preemption point at return from interrupt to kernel.

    do {
        do_stuff();
    }

So the only thing which makes that "work" is slapping a cond_resched()
into the loop:

    do {
        do_stuff();
        cond_resched();
    }

But the whole concept behind LAZY is that the loop will always be:

    do {
    	preempt_disable();
        do_stuff();
        preempt_enable();
    }

and the preempt_enable() will always be a functional preemption point.

So let's look at the simple case where more than one task is runnable on
a given CPU:

    loop()

      preempt_disable();

      --> tick interrupt
          set LAZY_NEED_RESCHED

      preempt_enable() -> Does nothing because NEED_RESCHED is not set

      preempt_disable();

      --> tick interrupt
          set NEED_RESCHED

      preempt_enable()
        preempt_schedule()
          schedule()
            report_QS()

which means that on the second tick a quiesent state is
reported. Whether that's really going to be a full tick which is granted
that's a scheduler decision and implementation detail and not really
relevant for discussing the concept.

Now the problematic case is when there is only one task runnable on a
given CPU because then the tick interrupt will set neither of the
preemption bits. Which is fine from a scheduler perspective, but not so
much from a RCU perspective.

But the whole point of LAZY is to be able to enforce rescheduling at the
next possible preemption point. So RCU can utilize that too:

rcu_flavor_sched_clock_irq(bool user)
{
	if (user || rcu_is_cpu_rrupt_from_idle() ||
	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
		rcu_qs();
                return;
	}

        if (this_cpu_read(rcu_data.rcu_urgent_qs))
        	set_need_resched();
}

So:

    loop()

      preempt_disable();

      --> tick interrupt
            rcu_flavor_sched_clock_irq()
                sets NEED_RESCHED

      preempt_enable()
        preempt_schedule()
          schedule()
            report_QS()

See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.

The above rcu_flavor_sched_clock_irq() check for rcu_data.rcu_urgent_qs
is not really fundamentaly different from the check in rcu_all_gs(). The
main difference is that it is bound to the tick, so the detection/action
might be delayed by a tick. If that turns out to be a problem, then this
stuff has far more serious issues underneath.

So now you might argue that for a loop like this:

    do {
        mutex_lock();
        do_stuff();
        mutex_unlock();
    }

the ideal preemption point is post mutex_unlock(), which is where
someone would mindfully (*cough*) place a cond_resched(), right?

So if that turns out to matter in reality and not just by academic
inspection, then we are far better off to annotate such code with:

    do {
        preempt_lazy_disable();
        mutex_lock();
        do_stuff();
        mutex_unlock();
        preempt_lazy_enable();
    }

and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.

Then rcu_flavor_sched_clock_irq(bool user) can then use a two stage
approach like the scheduler:

rcu_flavor_sched_clock_irq(bool user)
{
	if (user || rcu_is_cpu_rrupt_from_idle() ||
	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
		rcu_qs();
                return;
	}

        if (this_cpu_read(rcu_data.rcu_urgent_qs)) {
        	if (!need_resched_lazy()))
                	set_need_resched_lazy();
                else
                	set_need_resched();
	}
}

But for a start I would just use the trivial

        if (this_cpu_read(rcu_data.rcu_urgent_qs))
        	set_need_resched();

approach and see where this gets us.

With the approach I suggested to Ankur, i.e. having PREEMPT_AUTO(or
LAZY) as a config option we can work on the details of the AUTO and
RCU_PREEMPT=n flavour up to the point where we are happy to get rid
of the whole zoo of config options alltogether.

Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
is not really getting us anywhere.

Thanks,

        tglx
Paul E. McKenney Dec. 5, 2023, 1:33 a.m. UTC | #18
On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
> > But...
> >
> > Suppose we have a long-running loop in the kernel that regularly
> > enables preemption, but only momentarily.  Then the added
> > rcu_flavor_sched_clock_irq() check would almost always fail, making
> > for extremely long grace periods.  Or did I miss a change that causes
> > preempt_enable() to help RCU out?
> 
> So first of all this is not any different from today and even with
> RCU_PREEMPT=y a tight loop:
> 
>     do {
>     	preempt_disable();
>         do_stuff();
>         preempt_enable();
>     }
> 
> will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
> it can do is to force reschedule/preemption after some time, which in
> turn ends up in a QS.

True, but we don't run RCU_PREEMPT=y on the fleet.  So although this
argument should offer comfort to those who would like to switch from
forced preemption to lazy preemption, it doesn't help for those of us
running NONE/VOLUNTARY.

I can of course compensate if need be by making RCU more aggressive with
the resched_cpu() hammer, which includes an IPI.  For non-nohz_full CPUs,
it currently waits halfway to the stall-warning timeout.

> The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
> that at all because the preempt_enable() is a NOOP and there is no
> preemption point at return from interrupt to kernel.
> 
>     do {
>         do_stuff();
>     }
> 
> So the only thing which makes that "work" is slapping a cond_resched()
> into the loop:
> 
>     do {
>         do_stuff();
>         cond_resched();
>     }

Yes, exactly.

> But the whole concept behind LAZY is that the loop will always be:
> 
>     do {
>     	preempt_disable();
>         do_stuff();
>         preempt_enable();
>     }
> 
> and the preempt_enable() will always be a functional preemption point.

Understood.  And if preempt_enable() can interact with RCU when requested,
I would expect that this could make quite a few calls to cond_resched()
provably unnecessary.  There was some discussion of this:

https://lore.kernel.org/all/0d6a8e80-c89b-4ded-8de1-8c946874f787@paulmck-laptop/

There were objections to an earlier version.  Is this version OK?

> So let's look at the simple case where more than one task is runnable on
> a given CPU:
> 
>     loop()
> 
>       preempt_disable();
> 
>       --> tick interrupt
>           set LAZY_NEED_RESCHED
> 
>       preempt_enable() -> Does nothing because NEED_RESCHED is not set
> 
>       preempt_disable();
> 
>       --> tick interrupt
>           set NEED_RESCHED
> 
>       preempt_enable()
>         preempt_schedule()
>           schedule()
>             report_QS()
> 
> which means that on the second tick a quiesent state is
> reported. Whether that's really going to be a full tick which is granted
> that's a scheduler decision and implementation detail and not really
> relevant for discussing the concept.

In my experience, the implementation details make themselves relevant
sooner or later, and often sooner...

> Now the problematic case is when there is only one task runnable on a
> given CPU because then the tick interrupt will set neither of the
> preemption bits. Which is fine from a scheduler perspective, but not so
> much from a RCU perspective.
> 
> But the whole point of LAZY is to be able to enforce rescheduling at the
> next possible preemption point. So RCU can utilize that too:
> 
> rcu_flavor_sched_clock_irq(bool user)
> {
> 	if (user || rcu_is_cpu_rrupt_from_idle() ||
> 	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> 		rcu_qs();
>                 return;
> 	}
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();
> }

Yes, that is one of the changes that would be good to make, as discussed
previously.

> So:
> 
>     loop()
> 
>       preempt_disable();
> 
>       --> tick interrupt
>             rcu_flavor_sched_clock_irq()
>                 sets NEED_RESCHED
> 
>       preempt_enable()
>         preempt_schedule()
>           schedule()
>             report_QS()
> 
> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.

Understood, but that does delay detection of that quiescent state by up
to one tick.

> The above rcu_flavor_sched_clock_irq() check for rcu_data.rcu_urgent_qs
> is not really fundamentaly different from the check in rcu_all_gs(). The
> main difference is that it is bound to the tick, so the detection/action
> might be delayed by a tick. If that turns out to be a problem, then this
> stuff has far more serious issues underneath.

Again, as I have stated before, one advantage of PREEMPT_COUNT=y is this
ability, so yes, believe it or not, I really do understand this.  ;-)

> So now you might argue that for a loop like this:
> 
>     do {
>         mutex_lock();
>         do_stuff();
>         mutex_unlock();
>     }
> 
> the ideal preemption point is post mutex_unlock(), which is where
> someone would mindfully (*cough*) place a cond_resched(), right?
> 
> So if that turns out to matter in reality and not just by academic
> inspection, then we are far better off to annotate such code with:
> 
>     do {
>         preempt_lazy_disable();
>         mutex_lock();
>         do_stuff();
>         mutex_unlock();
>         preempt_lazy_enable();
>     }
> 
> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.

I am not exactly sure what semantics you are proposing with this pairing
as opposed to "this would be a good time to preempt in response to the
pending lazy request".  But I do agree that something like this could
replace at least a few more instance of cond_resched(), so that is good.
Not necessarily all of them, though.

> Then rcu_flavor_sched_clock_irq(bool user) can then use a two stage
> approach like the scheduler:
> 
> rcu_flavor_sched_clock_irq(bool user)
> {
> 	if (user || rcu_is_cpu_rrupt_from_idle() ||
> 	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
> 		rcu_qs();
>                 return;
> 	}
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs)) {
>         	if (!need_resched_lazy()))
>                 	set_need_resched_lazy();
>                 else
>                 	set_need_resched();
> 	}
> }
> 
> But for a start I would just use the trivial
> 
>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();
> 
> approach and see where this gets us.

Agreed, I would start with the plain set_need_resched().  This shouldn't
happen all that often, on default x86 builds nine milliseconds into the
grace period.

> With the approach I suggested to Ankur, i.e. having PREEMPT_AUTO(or
> LAZY) as a config option we can work on the details of the AUTO and
> RCU_PREEMPT=n flavour up to the point where we are happy to get rid
> of the whole zoo of config options alltogether.

I agree that some simplification should be possible and would be
desireable.

> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
> is not really getting us anywhere.

Except that this is not what is happening, Thomas.  ;-)

You are asserting that all of the cond_resched() calls can safely be
eliminated.  That might well be, but more than assertion is required.
You have come up with some good ways of getting rid of some classes of
them, which is a very good and very welcome thing.  But that is not the
same as having proved that all of them may be safely removed.

							Thanx, Paul
Thomas Gleixner Dec. 6, 2023, 3:10 p.m. UTC | #19
Paul!

On Mon, Dec 04 2023 at 17:33, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
>> So:
>> 
>>     loop()
>> 
>>       preempt_disable();
>> 
>>       --> tick interrupt
>>             rcu_flavor_sched_clock_irq()
>>                 sets NEED_RESCHED
>> 
>>       preempt_enable()
>>         preempt_schedule()
>>           schedule()
>>             report_QS()
>> 
>> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.
>
> Understood, but that does delay detection of that quiescent state by up
> to one tick.

Sure, but does that really matter in practice?

>> So if that turns out to matter in reality and not just by academic
>> inspection, then we are far better off to annotate such code with:
>> 
>>     do {
>>         preempt_lazy_disable();
>>         mutex_lock();
>>         do_stuff();
>>         mutex_unlock();
>>         preempt_lazy_enable();
>>     }
>> 
>> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.
>
> I am not exactly sure what semantics you are proposing with this pairing
> as opposed to "this would be a good time to preempt in response to the
> pending lazy request".  But I do agree that something like this could
> replace at least a few more instance of cond_resched(), so that is good.
> Not necessarily all of them, though.

The main semantic difference is that such a mechanism is properly
nesting and can be eventually subsumed into the actual locking
constructs.

>> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
>> is not really getting us anywhere.
>
> Except that this is not what is happening, Thomas.  ;-)
>
> You are asserting that all of the cond_resched() calls can safely be
> eliminated.  That might well be, but more than assertion is required.
> You have come up with some good ways of getting rid of some classes of
> them, which is a very good and very welcome thing.  But that is not the
> same as having proved that all of them may be safely removed.

Neither have you proven that any of them will be required with the new
PREEMPT_LAZY model. :)

Your experience and knowledge in this area is certainly appreciated, but
under the changed semantics of LAZY it's debatable whether observations
and assumptions which are based on PREEMPT_NONE behaviour still apply.

We'll see.

Thanks,

        tglx
Ankur Arora Dec. 7, 2023, 1:31 a.m. UTC | #20
Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
>> Paul!
>>
>> On Mon, Nov 20 2023 at 16:38, Paul E. McKenney wrote:
>> > But...
>> >
>> > Suppose we have a long-running loop in the kernel that regularly
>> > enables preemption, but only momentarily.  Then the added
>> > rcu_flavor_sched_clock_irq() check would almost always fail, making
>> > for extremely long grace periods.  Or did I miss a change that causes
>> > preempt_enable() to help RCU out?
>>
>> So first of all this is not any different from today and even with
>> RCU_PREEMPT=y a tight loop:
>>
>>     do {
>>     	preempt_disable();
>>         do_stuff();
>>         preempt_enable();
>>     }
>>
>> will not allow rcu_flavor_sched_clock_irq() to detect QS reliably. All
>> it can do is to force reschedule/preemption after some time, which in
>> turn ends up in a QS.
>
> True, but we don't run RCU_PREEMPT=y on the fleet.  So although this
> argument should offer comfort to those who would like to switch from
> forced preemption to lazy preemption, it doesn't help for those of us
> running NONE/VOLUNTARY.
>
> I can of course compensate if need be by making RCU more aggressive with
> the resched_cpu() hammer, which includes an IPI.  For non-nohz_full CPUs,
> it currently waits halfway to the stall-warning timeout.
>
>> The current NONE/VOLUNTARY models, which imply RCU_PRREMPT=n cannot do
>> that at all because the preempt_enable() is a NOOP and there is no
>> preemption point at return from interrupt to kernel.
>>
>>     do {
>>         do_stuff();
>>     }
>>
>> So the only thing which makes that "work" is slapping a cond_resched()
>> into the loop:
>>
>>     do {
>>         do_stuff();
>>         cond_resched();
>>     }
>
> Yes, exactly.
>
>> But the whole concept behind LAZY is that the loop will always be:
>>
>>     do {
>>     	preempt_disable();
>>         do_stuff();
>>         preempt_enable();
>>     }
>>
>> and the preempt_enable() will always be a functional preemption point.
>
> Understood.  And if preempt_enable() can interact with RCU when requested,
> I would expect that this could make quite a few calls to cond_resched()
> provably unnecessary.  There was some discussion of this:
>
> https://lore.kernel.org/all/0d6a8e80-c89b-4ded-8de1-8c946874f787@paulmck-laptop/
>
> There were objections to an earlier version.  Is this version OK?

Copying that version here for discussion purposes:

        #define preempt_enable() \
        do { \
                barrier(); \
                if (unlikely(preempt_count_dec_and_test())) \
                        __preempt_schedule(); \
                else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \
                        (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK) == PREEMPT_OFFSET) && \
                        !irqs_disabled()) \
        ) \
                                rcu_all_qs(); \
        } while (0)

(sched_feat is not exposed outside the scheduler so I'm using the
!CONFIG_PREEMPT_RCU version here.)


I have two-fold objections to this: as PeterZ pointed out, this is
quite a bit heavier than the fairly minimal preempt_enable() -- both
conceptually where the preemption logic now needs to know about when
to check for a specific RCU quiescience state, and in terms of code
size (seems to add about a cacheline worth) to every preempt_enable()
site.

If we end up needing this, is it valid to just optimistically check if
a quiescent state needs to be registered (see below)?
Though this version exposes rcu_data.rcu_urgent_qs outside RCU but maybe
we can encapsulate that in linux/rcupdate.h.

For V1 will go with this simple check in rcu_flavor_sched_clock_irq()
and see where that gets us:

>         if (this_cpu_read(rcu_data.rcu_urgent_qs))
>         	set_need_resched();

---
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 9aa6358a1a16..d8139cda8814 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -226,9 +226,11 @@ do { \
 #ifdef CONFIG_PREEMPTION
 #define preempt_enable() \
 do { \
 	barrier(); \
 	if (unlikely(preempt_count_dec_and_test())) \
 		__preempt_schedule(); \
+	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
+		rcu_all_qs_check();
 } while (0)

 #define preempt_enable_notrace() \
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 41021080ad25..2ba2743d7ba3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -887,6 +887,17 @@ void rcu_all_qs(void)
 }
 EXPORT_SYMBOL_GPL(rcu_all_qs);

+void rcu_all_qs_check(void)
+{
+	if (((preempt_count() &
+	      (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) == PREEMPT_OFFSET) && \
+	      !irqs_disabled())
+
+		  rcu_all_qs();
+}
+EXPORT_SYMBOL_GP(rcu_all_qs);
+
+
 /*
  * Note a PREEMPTION=n context switch. The caller must have disabled interrupts.
  */


--
ankur
Steven Rostedt Dec. 7, 2023, 2:10 a.m. UTC | #21
On Wed, 06 Dec 2023 17:31:30 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> ---
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 9aa6358a1a16..d8139cda8814 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -226,9 +226,11 @@ do { \
>  #ifdef CONFIG_PREEMPTION
>  #define preempt_enable() \
>  do { \
>  	barrier(); \
>  	if (unlikely(preempt_count_dec_and_test())) \
>  		__preempt_schedule(); \
> +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \

Shouldn't this still have the:

	else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \

That is, is it needed when PREEMPT_RCU is set?

-- Steve


> +		rcu_all_qs_check();
>  } while (0)
> 
>  #define preempt_enable_notrace() \
Paul E. McKenney Dec. 7, 2023, 4:17 a.m. UTC | #22
On Wed, Dec 06, 2023 at 04:10:18PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Dec 04 2023 at 17:33, Paul E. McKenney wrote:
> > On Tue, Nov 28, 2023 at 06:04:33PM +0100, Thomas Gleixner wrote:
> >> So:
> >> 
> >>     loop()
> >> 
> >>       preempt_disable();
> >> 
> >>       --> tick interrupt
> >>             rcu_flavor_sched_clock_irq()
> >>                 sets NEED_RESCHED
> >> 
> >>       preempt_enable()
> >>         preempt_schedule()
> >>           schedule()
> >>             report_QS()
> >> 
> >> See? No magic nonsense in preempt_enable(), no cond_resched(), nothing.
> >
> > Understood, but that does delay detection of that quiescent state by up
> > to one tick.
> 
> Sure, but does that really matter in practice?

It might, but yes, I would expect it to matter far less than the other
things I have been calling out.

> >> So if that turns out to matter in reality and not just by academic
> >> inspection, then we are far better off to annotate such code with:
> >> 
> >>     do {
> >>         preempt_lazy_disable();
> >>         mutex_lock();
> >>         do_stuff();
> >>         mutex_unlock();
> >>         preempt_lazy_enable();
> >>     }
> >> 
> >> and let preempt_lazy_enable() evaluate the NEED_RESCHED_LAZY bit.
> >
> > I am not exactly sure what semantics you are proposing with this pairing
> > as opposed to "this would be a good time to preempt in response to the
> > pending lazy request".  But I do agree that something like this could
> > replace at least a few more instance of cond_resched(), so that is good.
> > Not necessarily all of them, though.
> 
> The main semantic difference is that such a mechanism is properly
> nesting and can be eventually subsumed into the actual locking
> constructs.

OK, fair enough.

And noting that testing should include workloads that exercise things
like mutex_lock() and mutex_trylock() fastpaths.

> >> Just insisting that RCU_PREEMPT=n requires cond_resched() and whatsoever
> >> is not really getting us anywhere.
> >
> > Except that this is not what is happening, Thomas.  ;-)
> >
> > You are asserting that all of the cond_resched() calls can safely be
> > eliminated.  That might well be, but more than assertion is required.
> > You have come up with some good ways of getting rid of some classes of
> > them, which is a very good and very welcome thing.  But that is not the
> > same as having proved that all of them may be safely removed.
> 
> Neither have you proven that any of them will be required with the new
> PREEMPT_LAZY model. :)

True.  But nor have you proven them unnecessary.  That will need to
wait for larger-scale testing.

> Your experience and knowledge in this area is certainly appreciated, but
> under the changed semantics of LAZY it's debatable whether observations
> and assumptions which are based on PREEMPT_NONE behaviour still apply.
> 
> We'll see.

That we will!

							Thanx, Paul
Paul E. McKenney Dec. 7, 2023, 4:37 a.m. UTC | #23
On Wed, Dec 06, 2023 at 09:10:22PM -0500, Steven Rostedt wrote:
> On Wed, 06 Dec 2023 17:31:30 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > ---
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index 9aa6358a1a16..d8139cda8814 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -226,9 +226,11 @@ do { \
> >  #ifdef CONFIG_PREEMPTION
> >  #define preempt_enable() \
> >  do { \
> >  	barrier(); \
> >  	if (unlikely(preempt_count_dec_and_test())) \
> >  		__preempt_schedule(); \
> > +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
> 
> Shouldn't this still have the:
> 
> 	else if (!IS_ENABLED(CONFIG_PREEMPT_RCU) && \
> 
> That is, is it needed when PREEMPT_RCU is set?

Given that PREEMPT_RCU has been getting along fine without it, I agree
with Steve on this one.  Unless and until someone demonstrates otherwise,
but such a demonstration would almost certainly affect current code,
not just the lazy-preemption changes.

							Thanx, Paul

> -- Steve
> 
> 
> > +		rcu_all_qs_check();
> >  } while (0)
> > 
> >  #define preempt_enable_notrace() \
Thomas Gleixner Dec. 7, 2023, 2:22 p.m. UTC | #24
On Wed, Dec 06 2023 at 17:31, Ankur Arora wrote:
> If we end up needing this, is it valid to just optimistically check if
> a quiescent state needs to be registered (see below)?
> Though this version exposes rcu_data.rcu_urgent_qs outside RCU but maybe
> we can encapsulate that in linux/rcupdate.h.

>  #ifdef CONFIG_PREEMPTION
>  #define preempt_enable() \
>  do { \
>  	barrier(); \
>  	if (unlikely(preempt_count_dec_and_test())) \
>  		__preempt_schedule(); \
> +	else if (unlikely(raw_cpu_read(rcu_data.rcu_urgent_qs))) \
> +		rcu_all_qs_check();

It's still bloat and we can debate this once we come to the conclusion
that the simple forced reschedule is not sufficient. Until then debating
this is just an academic exercise.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f87191e008ff..618f055f8028 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -963,7 +963,8 @@  static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
  */
 static void rcu_flavor_sched_clock_irq(int user)
 {
-	if (user || rcu_is_cpu_rrupt_from_idle()) {
+	if (user || rcu_is_cpu_rrupt_from_idle() ||
+	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 
 		/*
 		 * Get here if this CPU took its interrupt from user
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf5df2b866df..15db5fb7acc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8588,20 +8588,7 @@  int __sched _cond_resched(void)
 		preempt_schedule_common();
 		return 1;
 	}
-	/*
-	 * In preemptible kernels, ->rcu_read_lock_nesting tells the tick
-	 * whether the current CPU is in an RCU read-side critical section,
-	 * so the tick can report quiescent states even for CPUs looping
-	 * in kernel context.  In contrast, in non-preemptible kernels,
-	 * RCU readers leave no in-memory hints, which means that CPU-bound
-	 * processes executing in kernel context might never report an
-	 * RCU quiescent state.  Therefore, the following code causes
-	 * cond_resched() to report a quiescent state, but only when RCU
-	 * is in urgent need of one.
-	 */
-#ifndef CONFIG_PREEMPT_RCU
-	rcu_all_qs();
-#endif
+
 	return 0;
 }
 EXPORT_SYMBOL(_cond_resched);