diff mbox series

[RFC,47/86] rcu: select PREEMPT_RCU if PREEMPT

Message ID 20231107215742.363031-48-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series Make the kernel preemptible | expand

Commit Message

Ankur Arora Nov. 7, 2023, 9:57 p.m. UTC
With PREEMPTION being always-on, some configurations might prefer
the stronger forward-progress guarantees provided by PREEMPT_RCU=n
as compared to PREEMPT_RCU=y.

So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.

Note that the preemption model can be changed at runtime (modulo
configurations with ARCH_NO_PREEMPT), but the RCU configuration
is statically compiled.

Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

---
CC-note: Paul had flagged some code that might be impacted
with the proposed RCU changes:

1. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
   unchanged, but I must defer to the include/net/ip_vs.h people.

2. I need to check with the BPF folks on the BPF verifier's
   definition of BTF_ID(func, rcu_read_unlock_strict).

3. I must defer to others on the mm/pgtable-generic.c file's
   #ifdef that depends on CONFIG_PREEMPT_RCU.

Detailed here:
 https://lore.kernel.org/lkml/a375674b-de27-4965-a4bf-e0679229e28e@paulmck-laptop/

---
 include/linux/rcutree.h | 2 +-
 kernel/rcu/Kconfig      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Steven Rostedt Nov. 8, 2023, 12:27 a.m. UTC | #1
On Tue,  7 Nov 2023 13:57:33 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> With PREEMPTION being always-on, some configurations might prefer
> the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> as compared to PREEMPT_RCU=y.
> 
> So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> 
> Note that the preemption model can be changed at runtime (modulo
> configurations with ARCH_NO_PREEMPT), but the RCU configuration
> is statically compiled.

I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
when PREEMPT=y?

This could allow us to test this without this having to be part of this
series.

-- Steve
Julian Anastasov Nov. 8, 2023, 12:15 p.m. UTC | #2
Hello,

On Tue, 7 Nov 2023, Ankur Arora wrote:

> With PREEMPTION being always-on, some configurations might prefer
> the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> as compared to PREEMPT_RCU=y.
> 
> So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> 
> Note that the preemption model can be changed at runtime (modulo
> configurations with ARCH_NO_PREEMPT), but the RCU configuration
> is statically compiled.
> 
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> ---
> CC-note: Paul had flagged some code that might be impacted
> with the proposed RCU changes:
> 
> 1. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
>    unchanged, but I must defer to the include/net/ip_vs.h people.

	Yes, IPVS_EST_TICK_CHAINS depends on the rcu_read_unlock()
and rcu_read_lock() calls in cond_resched_rcu(), so just removing
the cond_resched() call there is ok for us. Same for the other
cond_resched() calls in ipvs/

Regards

--
Julian Anastasov <ja@ssi.bg>
Paul E. McKenney Nov. 21, 2023, 12:28 a.m. UTC | #3
On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> On Tue,  7 Nov 2023 13:57:33 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> > With PREEMPTION being always-on, some configurations might prefer
> > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > as compared to PREEMPT_RCU=y.
> > 
> > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > 
> > Note that the preemption model can be changed at runtime (modulo
> > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > is statically compiled.
> 
> I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> when PREEMPT=y?

You mean independent of this series?  If so, I am not all that excited
about allowing a new option due to the effect on testing.  With this full
series, the number of test scenarios is preserved.

Actually, that is not exactly true, is it?  It would be if we instead had
something like this:

config PREEMPT_RCU
	bool
	default y if PREEMPT || PREEMPT_RT
	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
	select TREE_RCU

Any reason why this would be a problem?

Or to put it another way, do you know of anyone who really wants
a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
(CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
how this combination could be at all helpful.  And if it is not helpful,
we should not allow people to shoot themselves in the foot with it.

> This could allow us to test this without this having to be part of this
> series.

OK, if you mean for testing purposes but not to go to mainline without
the rest of the series, I am good with that idea.

And thank you to Ankur for preserving non-preemptible RCU for those of us
using system that are adequately but not generously endowed with memory!

							Thanx, Paul
Steven Rostedt Nov. 21, 2023, 3:43 a.m. UTC | #4
On Mon, 20 Nov 2023 16:28:50 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> > On Tue,  7 Nov 2023 13:57:33 -0800
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >   
> > > With PREEMPTION being always-on, some configurations might prefer
> > > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > > as compared to PREEMPT_RCU=y.
> > > 
> > > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > > 
> > > Note that the preemption model can be changed at runtime (modulo
> > > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > > is statically compiled.  
> > 
> > I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> > when PREEMPT=y?  
> 
> You mean independent of this series?  If so, I am not all that excited
> about allowing a new option due to the effect on testing.  With this full
> series, the number of test scenarios is preserved.
> 
> Actually, that is not exactly true, is it?  It would be if we instead had
> something like this:
> 
> config PREEMPT_RCU
> 	bool
> 	default y if PREEMPT || PREEMPT_RT
> 	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
> 	select TREE_RCU
> 
> Any reason why this would be a problem?

Yes, because with this series, there isn't going to be PREEMPT_NONE,
PREEMPT_VOLUNTARY and PREEMPT as a config option. I mean, you could define
the preference you want at boot up. But it could change at run time.

> 
> Or to put it another way, do you know of anyone who really wants
> a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
> and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
> (CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
> how this combination could be at all helpful.  And if it is not helpful,
> we should not allow people to shoot themselves in the foot with it.

With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
NEED_RESCHED will schedule when possible (non-preempt disable section).

 Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
      N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)

			SCHED_OTHER	REAL-TIME/DL
			  Schedule	  Schedule

NONE:			      L		     L

VOLUNTARY:		      L		     N

PREEMPT:		      N		     N


So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
Which means, it will not schedule until it goes into user space (*).

On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
So that RT and DL get scheduled just like PREEMPT does today.

On PREEMPT, NEED_RESCHED is always set on all scheduling.

(*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
NEED_RESCHED will be set and the kernel will schedule at the next available
moment, this is true for all three models!

There may be more details to work out, but the above is basically the gist
of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
can go from NONE to PREEMPT full! But there may be use cases that do not
want the overhead of always having RCU_PREEMPT, and will want RCU to be a
preempt_disable() section no matter what.

Unless we can switch between RCU_PREEMPT and !RCU_PREEMPT at run time, the
dependency on RCU_PREEMPT tied to PREEMPT doesn't make sense anymore.

> 
> > This could allow us to test this without this having to be part of this
> > series.  
> 
> OK, if you mean for testing purposes but not to go to mainline without
> the rest of the series, I am good with that idea.
> 
> And thank you to Ankur for preserving non-preemptible RCU for those of us
> using system that are adequately but not generously endowed with memory!

Exactly. It sounds like having non-preempt RCU is agnostic to the
preemption model of the system, which is why I think we need to make them
disjoint.

-- Steve
Paul E. McKenney Nov. 21, 2023, 5:04 a.m. UTC | #5
On Mon, Nov 20, 2023 at 10:43:56PM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2023 16:28:50 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
> > > On Tue,  7 Nov 2023 13:57:33 -0800
> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > >   
> > > > With PREEMPTION being always-on, some configurations might prefer
> > > > the stronger forward-progress guarantees provided by PREEMPT_RCU=n
> > > > as compared to PREEMPT_RCU=y.
> > > > 
> > > > So, select PREEMPT_RCU=n for PREEMPT_VOLUNTARY and PREEMPT_NONE and
> > > > enabling PREEMPT_RCU=y for PREEMPT or PREEMPT_RT.
> > > > 
> > > > Note that the preemption model can be changed at runtime (modulo
> > > > configurations with ARCH_NO_PREEMPT), but the RCU configuration
> > > > is statically compiled.  
> > > 
> > > I wonder if we should make this a separate patch, and allow PREEMPT_RCU=n
> > > when PREEMPT=y?  
> > 
> > You mean independent of this series?  If so, I am not all that excited
> > about allowing a new option due to the effect on testing.  With this full
> > series, the number of test scenarios is preserved.
> > 
> > Actually, that is not exactly true, is it?  It would be if we instead had
> > something like this:
> > 
> > config PREEMPT_RCU
> > 	bool
> > 	default y if PREEMPT || PREEMPT_RT
> > 	depends on !PREEMPT_NONE && !PREEMPT_VOLUNTARY
> > 	select TREE_RCU
> > 
> > Any reason why this would be a problem?
> 
> Yes, because with this series, there isn't going to be PREEMPT_NONE,
> PREEMPT_VOLUNTARY and PREEMPT as a config option. I mean, you could define
> the preference you want at boot up. But it could change at run time.

I applied the series, and there was still a PREEMPT_NONE.  Some might
consider the name to be a bit misleading, perhaps, but it was still there.

Ah, I missed patch 30/86.  The idea is to make CONFIG_PREEMPT_DYNAMIC
unconditional?  Why?  

> > Or to put it another way, do you know of anyone who really wants
> > a preemptible kernel (CONFIG_PREEMPT=y, CONFIG_PREEMPT_NONE=n
> > and CONFIG_PREEMPT_VOLUNTARY=n) but also non-preemptible RCU
> > (CONFIG_PREEMPT_RCU=y)?  If so, why?  I am having some difficulty seeing
> > how this combination could be at all helpful.  And if it is not helpful,
> > we should not allow people to shoot themselves in the foot with it.
> 
> With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
> determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
> NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
> NEED_RESCHED will schedule when possible (non-preempt disable section).

So NONE really is still supposed to be there.  ;-)

>  Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
>       N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)
> 
> 			SCHED_OTHER	REAL-TIME/DL
> 			  Schedule	  Schedule
> 
> NONE:			      L		     L
> 
> VOLUNTARY:		      L		     N
> 
> PREEMPT:		      N		     N
> 
> 
> So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
> Which means, it will not schedule until it goes into user space (*).
> 
> On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
> So that RT and DL get scheduled just like PREEMPT does today.
> 
> On PREEMPT, NEED_RESCHED is always set on all scheduling.
> 
> (*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
> NEED_RESCHED will be set and the kernel will schedule at the next available
> moment, this is true for all three models!

OK, so I see that this is now a SCHED_FEAT, and is initialized based
on CONFIG_PREEMPT_* in kernel/sched/feature.h.  Huh.  OK, we can still
control this at build time, which is fine.  I don't see how to set it
at boot time, only at build time or from debugfs.  I will let those who
want to set this at boot time complain, should they choose to do so.

> There may be more details to work out, but the above is basically the gist
> of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
> can go from NONE to PREEMPT full! But there may be use cases that do not
> want the overhead of always having RCU_PREEMPT, and will want RCU to be a
> preempt_disable() section no matter what.

Understood, actually.  And as noted in other replies, I am a bit concerned
about added latencies from too aggressively removing cond_resched().

More testing required.

> Unless we can switch between RCU_PREEMPT and !RCU_PREEMPT at run time, the
> dependency on RCU_PREEMPT tied to PREEMPT doesn't make sense anymore.

I strongly recommend against runtime switching of RCU's preemptibility,
just in case you were wondering.  ;-)

My question is different.

Would anyone want PREEMPT (N N above) in combination with non-preemptible
RCU?  I cannot see why anyone would want this.

> > > This could allow us to test this without this having to be part of this
> > > series.  
> > 
> > OK, if you mean for testing purposes but not to go to mainline without
> > the rest of the series, I am good with that idea.
> > 
> > And thank you to Ankur for preserving non-preemptible RCU for those of us
> > using system that are adequately but not generously endowed with memory!
> 
> Exactly. It sounds like having non-preempt RCU is agnostic to the
> preemption model of the system, which is why I think we need to make them
> disjoint.

How about like this, where "Y" means allowed and "N" means not allowed:

			Non-Preemptible RCU	Preemptible RCU

NONE:				Y			Y

VOLUNTARY:			Y			Y

PREEMPT:			N			Y

PREEMPT_RT:			N			Y


We need preemptible RCU for NONE and VOLUNTARY, as you say,
to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
But again, I don't see why anyone would want (much less need)
non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
neither wanted nor needed, there is no point in enabling it, much less
testing it.

Or am I missing a use case in there somewhere?

							Thanx, Paul
Ankur Arora Nov. 21, 2023, 5:39 a.m. UTC | #6
Paul E. McKenney <paulmck@kernel.org> writes:

> On Mon, Nov 20, 2023 at 10:43:56PM -0500, Steven Rostedt wrote:
>> On Mon, 20 Nov 2023 16:28:50 -0800
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>> > On Tue, Nov 07, 2023 at 07:27:03PM -0500, Steven Rostedt wrote:
>> > > On Tue,  7 Nov 2023 13:57:33 -0800
>> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> > >
>> With the new preemption model, NONE, VOLUNTARY and PREEMPT are now going to
>> determine when NEED_RESCHED is set as supposed to NEED_RESCHED_LAZY. As
>> NEED_RESCHED_LAZY only schedules at kernel / user space transaction, and
>> NEED_RESCHED will schedule when possible (non-preempt disable section).
>
> So NONE really is still supposed to be there.  ;-)
>
>>  Key: L - NEED_RESCHED_LAZY - schedule only at kernel/user boundary
>>       N - NEED_RESCHED - schedule whenever possible (like PREEMPT does today)
>>
>> 			SCHED_OTHER	REAL-TIME/DL
>> 			  Schedule	  Schedule
>>
>> NONE:		      L		     L
>>
>> VOLUNTARY:		      L		     N
>>
>> PREEMPT:		      N		     N
>>
>>
>> So on NONE, NEED_RESCHED_LAZY is set only on scheduling SCHED_OTHER and RT.
>> Which means, it will not schedule until it goes into user space (*).
>>
>> On VOLUNTARY, NEED_RESCHED is set on RT/DL tasks, and LAZY on SCHED_OTHER.
>> So that RT and DL get scheduled just like PREEMPT does today.
>>
>> On PREEMPT, NEED_RESCHED is always set on all scheduling.
>>
>> (*) - caveat - After the next tick, if NEED_RESCHED_LAZY is set, then
>> NEED_RESCHED will be set and the kernel will schedule at the next available
>> moment, this is true for all three models!
>
> OK, so I see that this is now a SCHED_FEAT, and is initialized based
> on CONFIG_PREEMPT_* in kernel/sched/feature.h.  Huh.  OK, we can still
> control this at build time, which is fine.  I don't see how to set it
> at boot time, only at build time or from debugfs.  I will let those who
> want to set this at boot time complain, should they choose to do so.

Needless to say, these patches were just an RFC and so they both overreach
and also miss out things.

v1 adds a new preemption model which coexists with the current
CONFIG_PREEMPT_* and is broadly based on Thomas' PoC and is on the lines
we have been discussing.

>> There may be more details to work out, but the above is basically the gist
>> of the idea. Now, what do you want to do with RCU_PREEMPT? At run time, we
>> can go from NONE to PREEMPT full! But there may be use cases that do not
>> want the overhead of always having RCU_PREEMPT, and will want RCU to be a
>> preempt_disable() section no matter what.
>
> Understood, actually.  And as noted in other replies, I am a bit concerned
> about added latencies from too aggressively removing cond_resched().
>
> More testing required.

Yes, agreed.

--
ankur
Steven Rostedt Nov. 21, 2023, 3 p.m. UTC | #7
On Mon, 20 Nov 2023 21:04:28 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> How about like this, where "Y" means allowed and "N" means not allowed:
> 
> 			Non-Preemptible RCU	Preemptible RCU
> 
> NONE:				Y			Y
> 
> VOLUNTARY:			Y			Y
> 
> PREEMPT:			N			Y
> 
> PREEMPT_RT:			N			Y
> 
> 
> We need preemptible RCU for NONE and VOLUNTARY, as you say,
> to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
> CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
> But again, I don't see why anyone would want (much less need)
> non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
> neither wanted nor needed, there is no point in enabling it, much less
> testing it.
> 
> Or am I missing a use case in there somewhere?

As Ankur replied, this is just an RFC, not the main goal. I'm talking about
the end product which will get rid of the PREEMPT_NONE, PREEMPT_VOLUNTARY
and PREEMPT conifgs, and there will *only* be the PREEMPT_DYNAMIC and
PREEMPT_RT.

And yes, this is going to be a slow and long processes, to find and fix all
regressions. I too am concerned about the latency that this may add. I'm
thinking we could have NEED_RESCHED_LAZY preempt when there is no mutex or
other semi critical section held (like migrate_disable()).

Right now, the use of cond_resched() is basically a whack-a-mole game where
we need to whack all the mole loops with the cond_resched() hammer. As
Thomas said, this is backwards. It makes more sense to just not preempt in
areas that can cause pain (like holding a mutex or in an RCU critical
section), but still have the general kernel be fully preemptable.

-- Steve
Paul E. McKenney Nov. 21, 2023, 3:19 p.m. UTC | #8
On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2023 21:04:28 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > How about like this, where "Y" means allowed and "N" means not allowed:
> > 
> > 			Non-Preemptible RCU	Preemptible RCU
> > 
> > NONE:				Y			Y
> > 
> > VOLUNTARY:			Y			Y
> > 
> > PREEMPT:			N			Y
> > 
> > PREEMPT_RT:			N			Y
> > 
> > 
> > We need preemptible RCU for NONE and VOLUNTARY, as you say,
> > to allow CONFIG_PREEMPT_DYNAMIC to continue to work.  (OK, OK,
> > CONFIG_PREEMPT_DYNAMIC is no longer, but appears to be unconditional.)
> > But again, I don't see why anyone would want (much less need)
> > non-preemptible RCU in the PREEMPT and PREEMPT_RT cases.  And if it is
> > neither wanted nor needed, there is no point in enabling it, much less
> > testing it.
> > 
> > Or am I missing a use case in there somewhere?
> 
> As Ankur replied, this is just an RFC, not the main goal. I'm talking about
> the end product which will get rid of the PREEMPT_NONE, PREEMPT_VOLUNTARY
> and PREEMPT conifgs, and there will *only* be the PREEMPT_DYNAMIC and
> PREEMPT_RT.
> 
> And yes, this is going to be a slow and long processes, to find and fix all
> regressions. I too am concerned about the latency that this may add. I'm
> thinking we could have NEED_RESCHED_LAZY preempt when there is no mutex or
> other semi critical section held (like migrate_disable()).

Indeed.  For one thing, you have a lot of work to do to demonstrate
that this would actually be a good thing.  For example, what is so
horribly bad about selecting minimal preemption (NONE and/or VOLUNTARY)
at build time???

> Right now, the use of cond_resched() is basically a whack-a-mole game where
> we need to whack all the mole loops with the cond_resched() hammer. As
> Thomas said, this is backwards. It makes more sense to just not preempt in
> areas that can cause pain (like holding a mutex or in an RCU critical
> section), but still have the general kernel be fully preemptable.

Which is quite true, but that whack-a-mole game can be ended without
getting rid of build-time selection of the preemption model.  Also,
that whack-a-mole game can be ended without eliminating all calls to
cond_resched().

Additionally, if the end goal is to be fully preemptible as in eventually
eliminating lazy preemption, you have a lot more convincing to do.
For but one example, given the high cost of the additional context
switches that will visit on a number of performance-sensitive workloads.

So what exactly are you guys trying to accomplish here?  ;-)

							Thanx, Paul
Thomas Gleixner Nov. 28, 2023, 10:53 a.m. UTC | #9
Paul!

On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>> Right now, the use of cond_resched() is basically a whack-a-mole game where
>> we need to whack all the mole loops with the cond_resched() hammer. As
>> Thomas said, this is backwards. It makes more sense to just not preempt in
>> areas that can cause pain (like holding a mutex or in an RCU critical
>> section), but still have the general kernel be fully preemptable.
>
> Which is quite true, but that whack-a-mole game can be ended without
> getting rid of build-time selection of the preemption model.  Also,
> that whack-a-mole game can be ended without eliminating all calls to
> cond_resched().

Which calls to cond_resched() should not be eliminated?

They all suck and keeping some of them is just counterproductive as
again people will sprinkle them all over the place for the very wrong
reasons.

> Additionally, if the end goal is to be fully preemptible as in eventually
> eliminating lazy preemption, you have a lot more convincing to do.

That's absolutely not the case. Even RT uses the lazy mode to prevent
overeager preemption for non RT tasks.

The whole point of the exercise is to keep the kernel always fully
preemptible, but only enforce the immediate preemption at the next
possible preemption point when necessary.

The decision when it is necessary is made by the scheduler and not
delegated to the whim of cond/might_resched() placement.

That is serving both worlds best IMO:

  1) LAZY preemption prevents the negative side effects of overeager
     preemption, aka. lock contention and pointless context switching.

     The whole thing behaves like a NONE kernel unless there are
     real-time tasks or a task did not comply to the lazy request within
     a given time.

  2) It does not prevent the scheduler from making decisions to preempt
     at the next possible preemption point in order to get some
     important computation on the CPU.

     A NONE kernel sucks vs. any sporadic [real-time] task. Just run
     NONE and watch the latencies. The latencies are determined by the
     interrupted context, the placement of the cond_resched() call and
     the length of the loop which is running.

     People have complained about that and the only way out for them is
     to switch to VOLUNTARY or FULL preemption and thereby paying the
     price for overeager preemption.

     A price which you don't want to pay for good reasons but at the
     same time you care about latencies in some aspects and the only
     answer you have for that is cond_resched() or similar which is not
     an answer at all.

  3) Looking at the initial problem Ankur was trying to solve there is
     absolutely no acceptable solution to solve that unless you think
     that the semantically invers 'allow_preempt()/disallow_preempt()'
     is anywhere near acceptable.

Thanks,

        tglx
Ankur Arora Nov. 28, 2023, 6:30 p.m. UTC | #10
Thomas Gleixner <tglx@linutronix.de> writes:

> Paul!
>
> On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>>> Right now, the use of cond_resched() is basically a whack-a-mole game where
>>> we need to whack all the mole loops with the cond_resched() hammer. As
>>> Thomas said, this is backwards. It makes more sense to just not preempt in
>>> areas that can cause pain (like holding a mutex or in an RCU critical
>>> section), but still have the general kernel be fully preemptable.
>>
>> Which is quite true, but that whack-a-mole game can be ended without
>> getting rid of build-time selection of the preemption model.  Also,
>> that whack-a-mole game can be ended without eliminating all calls to
>> cond_resched().
>
> Which calls to cond_resched() should not be eliminated?
>
> They all suck and keeping some of them is just counterproductive as
> again people will sprinkle them all over the place for the very wrong
> reasons.

And, as Thomas alludes to here, cond_resched() is not always cost free.
Needing to call cond_resched() forces us to restructure hot paths in
ways that results in worse performance/complex code.

One example is clear_huge_page(), where removing the need to call
cond_resched() every once in a while allows the processor to optimize
differently.

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

And, that's one of the simpler examples from mm. We do this kind of arbitrary
batching all over the place.

Or see the filemap_read() example that Linus gives here:
 https://lore.kernel.org/lkml/CAHk-=whpYjm_AizQij6XEfTd7xvGjrVCx5gzHcHm=2Xijt+Kyg@mail.gmail.com/#t

Thanks
--
ankur
Paul E. McKenney Dec. 5, 2023, 1:01 a.m. UTC | #11
On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> >> we need to whack all the mole loops with the cond_resched() hammer. As
> >> Thomas said, this is backwards. It makes more sense to just not preempt in
> >> areas that can cause pain (like holding a mutex or in an RCU critical
> >> section), but still have the general kernel be fully preemptable.
> >
> > Which is quite true, but that whack-a-mole game can be ended without
> > getting rid of build-time selection of the preemption model.  Also,
> > that whack-a-mole game can be ended without eliminating all calls to
> > cond_resched().
> 
> Which calls to cond_resched() should not be eliminated?

The ones which, if eliminated, will result in excessive latencies.

This question is going to take some time to answer.  One type of potential
issue is where the cond_resched() precedes something like mutex_lock(),
where that mutex_lock() takes the fast path and preemption follows
shortly thereafter.  It would clearly have been better to have preempted
before acquisition.

Another is the aforementioned situations where removing the cond_resched()
increases latency.  Yes, capping the preemption latency is a wonderful
thing, and the people I chatted with are all for that, but it is only
natural that there would be a corresponding level of concern about the
cases where removing the cond_resched() calls increases latency.

There might be others as well.  These are the possibilities that have
come up thus far.

> They all suck and keeping some of them is just counterproductive as
> again people will sprinkle them all over the place for the very wrong
> reasons.

Yes, but do they suck enough and are they counterproductive enough to
be useful and necessary?  ;-)

> > Additionally, if the end goal is to be fully preemptible as in eventually
> > eliminating lazy preemption, you have a lot more convincing to do.
> 
> That's absolutely not the case. Even RT uses the lazy mode to prevent
> overeager preemption for non RT tasks.

OK, that is very good to hear.

> The whole point of the exercise is to keep the kernel always fully
> preemptible, but only enforce the immediate preemption at the next
> possible preemption point when necessary.
> 
> The decision when it is necessary is made by the scheduler and not
> delegated to the whim of cond/might_resched() placement.

I am not arguing that the developer placing a given cond_resched()
always knows best, but you have some work to do to convince me that the
scheduler always knows best.

> That is serving both worlds best IMO:
> 
>   1) LAZY preemption prevents the negative side effects of overeager
>      preemption, aka. lock contention and pointless context switching.
> 
>      The whole thing behaves like a NONE kernel unless there are
>      real-time tasks or a task did not comply to the lazy request within
>      a given time.

Almost, give or take the potential issues called out above for the
possible downsides of removing all of the cond_resched() invocations.

>   2) It does not prevent the scheduler from making decisions to preempt
>      at the next possible preemption point in order to get some
>      important computation on the CPU.
> 
>      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
>      NONE and watch the latencies. The latencies are determined by the
>      interrupted context, the placement of the cond_resched() call and
>      the length of the loop which is running.
> 
>      People have complained about that and the only way out for them is
>      to switch to VOLUNTARY or FULL preemption and thereby paying the
>      price for overeager preemption.
> 
>      A price which you don't want to pay for good reasons but at the
>      same time you care about latencies in some aspects and the only
>      answer you have for that is cond_resched() or similar which is not
>      an answer at all.

All good points, but none of them are in conflict with the possibility
of leaving some cond_resched() calls behind if they ar needed.

>   3) Looking at the initial problem Ankur was trying to solve there is
>      absolutely no acceptable solution to solve that unless you think
>      that the semantically invers 'allow_preempt()/disallow_preempt()'
>      is anywhere near acceptable.

I am not arguing for allow_preempt()/disallow_preempt(), so for that
argument, you need to find someone else to argue with.  ;-)

							Thanx, Paul
Paul E. McKenney Dec. 5, 2023, 1:03 a.m. UTC | #12
On Tue, Nov 28, 2023 at 10:30:53AM -0800, Ankur Arora wrote:
> 
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > Paul!
> >
> > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> >> On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> >>> Right now, the use of cond_resched() is basically a whack-a-mole game where
> >>> we need to whack all the mole loops with the cond_resched() hammer. As
> >>> Thomas said, this is backwards. It makes more sense to just not preempt in
> >>> areas that can cause pain (like holding a mutex or in an RCU critical
> >>> section), but still have the general kernel be fully preemptable.
> >>
> >> Which is quite true, but that whack-a-mole game can be ended without
> >> getting rid of build-time selection of the preemption model.  Also,
> >> that whack-a-mole game can be ended without eliminating all calls to
> >> cond_resched().
> >
> > Which calls to cond_resched() should not be eliminated?
> >
> > They all suck and keeping some of them is just counterproductive as
> > again people will sprinkle them all over the place for the very wrong
> > reasons.
> 
> And, as Thomas alludes to here, cond_resched() is not always cost free.
> Needing to call cond_resched() forces us to restructure hot paths in
> ways that results in worse performance/complex code.
> 
> One example is clear_huge_page(), where removing the need to call
> cond_resched() every once in a while allows the processor to optimize
> differently.
> 
>   *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                           (GB/s)                (GB/s)
> 
>   pg-sz=2MB                14.55                 19.29    +32.5%
>   pg-sz=1GB                19.34                 49.60   +156.4%
> 
> (See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
> 
> And, that's one of the simpler examples from mm. We do this kind of arbitrary
> batching all over the place.
> 
> Or see the filemap_read() example that Linus gives here:
>  https://lore.kernel.org/lkml/CAHk-=whpYjm_AizQij6XEfTd7xvGjrVCx5gzHcHm=2Xijt+Kyg@mail.gmail.com/#t

I already agree that some cond_resched() calls can cause difficulties.
But that is not the same as proving that they *all* should be removed.

							Thanx, Paul
Steven Rostedt Dec. 5, 2023, 3:01 p.m. UTC | #13
On Mon, 4 Dec 2023 17:01:21 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> > Paul!
> > 
> > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:  
> > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:  
> > >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> > >> we need to whack all the mole loops with the cond_resched() hammer. As
> > >> Thomas said, this is backwards. It makes more sense to just not preempt in
> > >> areas that can cause pain (like holding a mutex or in an RCU critical
> > >> section), but still have the general kernel be fully preemptable.  
> > >
> > > Which is quite true, but that whack-a-mole game can be ended without
> > > getting rid of build-time selection of the preemption model.  Also,
> > > that whack-a-mole game can be ended without eliminating all calls to
> > > cond_resched().  
> > 
> > Which calls to cond_resched() should not be eliminated?  
> 
> The ones which, if eliminated, will result in excessive latencies.
> 
> This question is going to take some time to answer.  One type of potential
> issue is where the cond_resched() precedes something like mutex_lock(),
> where that mutex_lock() takes the fast path and preemption follows
> shortly thereafter.  It would clearly have been better to have preempted
> before acquisition.

Note that the new preemption model is a new paradigm and we need to start
thinking a bit differently if we go to it.

One thing I would like to look into with the new work is to have holding a
mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
converted to mutex in the RT kernel). That way you are less likely to be
preempted while holding a mutex.

> 
> Another is the aforementioned situations where removing the cond_resched()
> increases latency.  Yes, capping the preemption latency is a wonderful
> thing, and the people I chatted with are all for that, but it is only
> natural that there would be a corresponding level of concern about the
> cases where removing the cond_resched() calls increases latency.

With the "capped preemption" I'm not sure that would still be the case.
cond_resched() currently only preempts if NEED_RESCHED is set. That means
the system had to already be in a situation that a schedule needs to
happen. There's lots of places in the kernel that run for over a tick
without any cond_resched(). The cond_resched() is usually added for
locations that show tremendous latency (where either a watchdog triggered,
or showed up in some analysis that had a latency that was much greater than
a tick).

The point is, if/when we switch to the new preemption model, we would need
to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
done to prevent regressions. But the reasons I see cond_resched() being
added today, should no longer exist with this new model.

> 
> There might be others as well.  These are the possibilities that have
> come up thus far.
> 
> > They all suck and keeping some of them is just counterproductive as
> > again people will sprinkle them all over the place for the very wrong
> > reasons.  
> 
> Yes, but do they suck enough and are they counterproductive enough to
> be useful and necessary?  ;-)

They are only useful and necessary because of the way we handle preemption
today. With the new preemption model, they are all likely to be useless and
unnecessary ;-)

> 
> > > Additionally, if the end goal is to be fully preemptible as in
> > > eventually eliminating lazy preemption, you have a lot more
> > > convincing to do.  
> > 
> > That's absolutely not the case. Even RT uses the lazy mode to prevent
> > overeager preemption for non RT tasks.  
> 
> OK, that is very good to hear.

But the paradigm is changing. The kernel will be fully preemptible, it just
won't be preempting often. That is, if the CPU is running kernel code for
too long, and the scheduler tick wants a reschedule, the kernel has one
more tick to get back to user space before it will become fully
preemptible. That is, we force a "cond_resched()".

> 
> > The whole point of the exercise is to keep the kernel always fully
> > preemptible, but only enforce the immediate preemption at the next
> > possible preemption point when necessary.
> > 
> > The decision when it is necessary is made by the scheduler and not
> > delegated to the whim of cond/might_resched() placement.  
> 
> I am not arguing that the developer placing a given cond_resched()
> always knows best, but you have some work to do to convince me that the
> scheduler always knows best.

The cond_resched() already expects the scheduler to know best. It doesn't
resched unless NEED_RESCHED is set and that's determined by the scheduler.
If the code knows best, then it should just call schedule() and be done
with it.

> 
> > That is serving both worlds best IMO:
> > 
> >   1) LAZY preemption prevents the negative side effects of overeager
> >      preemption, aka. lock contention and pointless context switching.
> > 
> >      The whole thing behaves like a NONE kernel unless there are
> >      real-time tasks or a task did not comply to the lazy request within
> >      a given time.  
> 
> Almost, give or take the potential issues called out above for the
> possible downsides of removing all of the cond_resched() invocations.

I still don't believe there are any issues "called out above", as I called
out those called outs.

> 
> >   2) It does not prevent the scheduler from making decisions to preempt
> >      at the next possible preemption point in order to get some
> >      important computation on the CPU.
> > 
> >      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
> >      NONE and watch the latencies. The latencies are determined by the
> >      interrupted context, the placement of the cond_resched() call and
> >      the length of the loop which is running.
> > 
> >      People have complained about that and the only way out for them is
> >      to switch to VOLUNTARY or FULL preemption and thereby paying the
> >      price for overeager preemption.
> > 
> >      A price which you don't want to pay for good reasons but at the
> >      same time you care about latencies in some aspects and the only
> >      answer you have for that is cond_resched() or similar which is not
> >      an answer at all.  
> 
> All good points, but none of them are in conflict with the possibility
> of leaving some cond_resched() calls behind if they ar needed.

The conflict is with the new paradigm (I love that word! It's so "buzzy").
As I mentioned above, cond_resched() is usually added when a problem was
seen. I really believe that those problems would never had been seen if
the new paradigm had already been in place.

> 
> >   3) Looking at the initial problem Ankur was trying to solve there is
> >      absolutely no acceptable solution to solve that unless you think
> >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> >      is anywhere near acceptable.  
> 
> I am not arguing for allow_preempt()/disallow_preempt(), so for that
> argument, you need to find someone else to argue with.  ;-)

Anyway, there's still a long path before cond_resched() can be removed. It
was a mistake by Ankur to add those removals this early (and he has
acknowledged that mistake).

First we need to get the new preemption modeled implemented. When it is, it
can be just a config option at first. Then when that config option is set,
you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
between them at run time as they are just a way to tell the scheduler when
to set NEED_RESCHED_LAZY vs NEED_RSECHED.

At that moment, when that config is set, the cond_resched() can turn into a
nop. This will allow for testing to make sure there are no regressions in
latency, even with the NONE mode enabled.

The real test is implementing the code and seeing how it affects things in
the real world. Us arguing about it isn't going to get anywhere. I just
don't want blind NACK. A NACK to a removal of a cond_resched() needs to
show that there was a real regression with that removal.

-- Steve
Paul E. McKenney Dec. 5, 2023, 7:38 p.m. UTC | #14
On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
> On Mon, 4 Dec 2023 17:01:21 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:  
> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:  
> > > >> Right now, the use of cond_resched() is basically a whack-a-mole game where
> > > >> we need to whack all the mole loops with the cond_resched() hammer. As
> > > >> Thomas said, this is backwards. It makes more sense to just not preempt in
> > > >> areas that can cause pain (like holding a mutex or in an RCU critical
> > > >> section), but still have the general kernel be fully preemptable.  
> > > >
> > > > Which is quite true, but that whack-a-mole game can be ended without
> > > > getting rid of build-time selection of the preemption model.  Also,
> > > > that whack-a-mole game can be ended without eliminating all calls to
> > > > cond_resched().  
> > > 
> > > Which calls to cond_resched() should not be eliminated?  
> > 
> > The ones which, if eliminated, will result in excessive latencies.
> > 
> > This question is going to take some time to answer.  One type of potential
> > issue is where the cond_resched() precedes something like mutex_lock(),
> > where that mutex_lock() takes the fast path and preemption follows
> > shortly thereafter.  It would clearly have been better to have preempted
> > before acquisition.
> 
> Note that the new preemption model is a new paradigm and we need to start
> thinking a bit differently if we go to it.

We can of course think differently, but existing hardware and software
will probably be a bit more stubborn.

> One thing I would like to look into with the new work is to have holding a
> mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> converted to mutex in the RT kernel). That way you are less likely to be
> preempted while holding a mutex.

I like the concept, but those with mutex_lock() of rarely-held mutexes
in their fastpaths might have workloads that have a contrary opinion.

> > Another is the aforementioned situations where removing the cond_resched()
> > increases latency.  Yes, capping the preemption latency is a wonderful
> > thing, and the people I chatted with are all for that, but it is only
> > natural that there would be a corresponding level of concern about the
> > cases where removing the cond_resched() calls increases latency.
> 
> With the "capped preemption" I'm not sure that would still be the case.
> cond_resched() currently only preempts if NEED_RESCHED is set. That means
> the system had to already be in a situation that a schedule needs to
> happen. There's lots of places in the kernel that run for over a tick
> without any cond_resched(). The cond_resched() is usually added for
> locations that show tremendous latency (where either a watchdog triggered,
> or showed up in some analysis that had a latency that was much greater than
> a tick).

For non-real-time workloads, the average case is important, not just the
worst case.  In the new lazily preemptible mode of thought, a preemption
by a non-real-time task will wait a tick.  Earlier, it would have waited
for the next cond_resched().  Which, in the average case, might have
arrived much sooner than one tick.

> The point is, if/when we switch to the new preemption model, we would need
> to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> done to prevent regressions. But the reasons I see cond_resched() being
> added today, should no longer exist with this new model.

This I agree with.  Also, with the new paradigm and new mode of thought
in place, it should be safe to drop any cond_resched() that is in a loop
that consumes more than a tick of CPU time per iteration.

> > There might be others as well.  These are the possibilities that have
> > come up thus far.
> > 
> > > They all suck and keeping some of them is just counterproductive as
> > > again people will sprinkle them all over the place for the very wrong
> > > reasons.  
> > 
> > Yes, but do they suck enough and are they counterproductive enough to
> > be useful and necessary?  ;-)
> 
> They are only useful and necessary because of the way we handle preemption
> today. With the new preemption model, they are all likely to be useless and
> unnecessary ;-)

The "all likely" needs some demonstration.  I agree that a great many
of them would be useless and unnecessary.  Maybe even the vast majority.
But that is different than "all".  ;-)

> > > > Additionally, if the end goal is to be fully preemptible as in
> > > > eventually eliminating lazy preemption, you have a lot more
> > > > convincing to do.  
> > > 
> > > That's absolutely not the case. Even RT uses the lazy mode to prevent
> > > overeager preemption for non RT tasks.  
> > 
> > OK, that is very good to hear.
> 
> But the paradigm is changing. The kernel will be fully preemptible, it just
> won't be preempting often. That is, if the CPU is running kernel code for
> too long, and the scheduler tick wants a reschedule, the kernel has one
> more tick to get back to user space before it will become fully
> preemptible. That is, we force a "cond_resched()".

And as stated quite a few times previously in this and earlier threads,
yes, removing the need to drop cond_resched() into longer-than-average
loops is a very good thing.

> > > The whole point of the exercise is to keep the kernel always fully
> > > preemptible, but only enforce the immediate preemption at the next
> > > possible preemption point when necessary.
> > > 
> > > The decision when it is necessary is made by the scheduler and not
> > > delegated to the whim of cond/might_resched() placement.  
> > 
> > I am not arguing that the developer placing a given cond_resched()
> > always knows best, but you have some work to do to convince me that the
> > scheduler always knows best.
> 
> The cond_resched() already expects the scheduler to know best. It doesn't
> resched unless NEED_RESCHED is set and that's determined by the scheduler.
> If the code knows best, then it should just call schedule() and be done
> with it.

A distinction without a difference.  After all, if the scheduler really
knew best, it would be able to intuit the cond_resched() without that
cond_resched() actually being there.  Which is arguably the whole point
of this patch series, aside from mutexes, the possibility of extending
what are now short preemption times, and who knows what all else.

> > > That is serving both worlds best IMO:
> > > 
> > >   1) LAZY preemption prevents the negative side effects of overeager
> > >      preemption, aka. lock contention and pointless context switching.
> > > 
> > >      The whole thing behaves like a NONE kernel unless there are
> > >      real-time tasks or a task did not comply to the lazy request within
> > >      a given time.  
> > 
> > Almost, give or take the potential issues called out above for the
> > possible downsides of removing all of the cond_resched() invocations.
> 
> I still don't believe there are any issues "called out above", as I called
> out those called outs.

Well, you did write some words, if that is what you meant.  ;-)

> > >   2) It does not prevent the scheduler from making decisions to preempt
> > >      at the next possible preemption point in order to get some
> > >      important computation on the CPU.
> > > 
> > >      A NONE kernel sucks vs. any sporadic [real-time] task. Just run
> > >      NONE and watch the latencies. The latencies are determined by the
> > >      interrupted context, the placement of the cond_resched() call and
> > >      the length of the loop which is running.
> > > 
> > >      People have complained about that and the only way out for them is
> > >      to switch to VOLUNTARY or FULL preemption and thereby paying the
> > >      price for overeager preemption.
> > > 
> > >      A price which you don't want to pay for good reasons but at the
> > >      same time you care about latencies in some aspects and the only
> > >      answer you have for that is cond_resched() or similar which is not
> > >      an answer at all.  
> > 
> > All good points, but none of them are in conflict with the possibility
> > of leaving some cond_resched() calls behind if they ar needed.
> 
> The conflict is with the new paradigm (I love that word! It's so "buzzy").
> As I mentioned above, cond_resched() is usually added when a problem was
> seen. I really believe that those problems would never had been seen if
> the new paradigm had already been in place.

Indeed, that sort of wording does quite the opposite of raising my
confidence levels.  ;-)

You know, the ancient Romans would have had no problem dealing with the
dot-com boom, cryptocurrency, some of the shadier areas of artificial
intelligence and machine learning, and who knows what all else.  As the
Romans used to say, "Beware of geeks bearing grifts."

> > >   3) Looking at the initial problem Ankur was trying to solve there is
> > >      absolutely no acceptable solution to solve that unless you think
> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > >      is anywhere near acceptable.  
> > 
> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > argument, you need to find someone else to argue with.  ;-)
> 
> Anyway, there's still a long path before cond_resched() can be removed. It
> was a mistake by Ankur to add those removals this early (and he has
> acknowledged that mistake).

OK, that I can live with.  But that seems to be a bit different of a
take than that of some earlier emails in this thread.  ;-)

> First we need to get the new preemption modeled implemented. When it is, it
> can be just a config option at first. Then when that config option is set,
> you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> between them at run time as they are just a way to tell the scheduler when
> to set NEED_RESCHED_LAZY vs NEED_RSECHED.

Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
the runtime switching needs to be limited to NONE and VOLUNTARY.
Which is fine.

> At that moment, when that config is set, the cond_resched() can turn into a
> nop. This will allow for testing to make sure there are no regressions in
> latency, even with the NONE mode enabled.

And once it appears to be reasonably stable (in concept as well as
implementation), heavy testing should get underway.

> The real test is implementing the code and seeing how it affects things in
> the real world. Us arguing about it isn't going to get anywhere.

Indeed, the opinion of the objective universe always wins.  It all too
often takes longer than necessary for the people arguing with each other
to realize this, but such is life.

>                                                                  I just
> don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> show that there was a real regression with that removal.

Fair enough, although a single commit bulk removing a large number of
cond_resched() calls will likely get a bulk NAK.

							Thanx, Paul
Ankur Arora Dec. 5, 2023, 8:18 p.m. UTC | #15
Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
>> On Mon, 4 Dec 2023 17:01:21 -0800
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
>> > > Paul!
>> > >
>> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
...
>> > >   3) Looking at the initial problem Ankur was trying to solve there is
>> > >      absolutely no acceptable solution to solve that unless you think
>> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
>> > >      is anywhere near acceptable.
>> >
>> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
>> > argument, you need to find someone else to argue with.  ;-)
>>
>> Anyway, there's still a long path before cond_resched() can be removed. It
>> was a mistake by Ankur to add those removals this early (and he has
>> acknowledged that mistake).
>
> OK, that I can live with.  But that seems to be a bit different of a
> take than that of some earlier emails in this thread.  ;-)

Heh I think it's just that this thread goes to (far) too many places :).

As Steven says, the initial series touching everything all together
was a mistake. V1 adds the new preemption model alongside the existing
ones locally defines cond_resched() as nop.

That'll allow us to experiment and figure out where there are latency
gaps.

Ankur
Steven Rostedt Dec. 5, 2023, 8:45 p.m. UTC | #16
On Tue, 5 Dec 2023 11:38:42 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > 
> > Note that the new preemption model is a new paradigm and we need to start
> > thinking a bit differently if we go to it.  
> 
> We can of course think differently, but existing hardware and software
> will probably be a bit more stubborn.

Not at all. I don't see how hardware plays a role here, but how software is
designed does sometimes require thinking differently.

> 
> > One thing I would like to look into with the new work is to have holding a
> > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > converted to mutex in the RT kernel). That way you are less likely to be
> > preempted while holding a mutex.  
> 
> I like the concept, but those with mutex_lock() of rarely-held mutexes
> in their fastpaths might have workloads that have a contrary opinion.

I don't understand your above statement. Maybe I wasn't clear with my
statement? The above is more about PREEMPT_FULL, as it currently will
preempt immediately. My above comment is that we can have an option for
PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
it would at least hold off until there's no mutex held. Who cares if it's a
fast path when a task needs to give up the CPU for another task? What I
worry about is scheduling out while holding a mutex which increases the
chance of that mutex being contended upon. Which does have drastic impact
on performance.

> 
> > > Another is the aforementioned situations where removing the cond_resched()
> > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > thing, and the people I chatted with are all for that, but it is only
> > > natural that there would be a corresponding level of concern about the
> > > cases where removing the cond_resched() calls increases latency.  
> > 
> > With the "capped preemption" I'm not sure that would still be the case.
> > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > the system had to already be in a situation that a schedule needs to
> > happen. There's lots of places in the kernel that run for over a tick
> > without any cond_resched(). The cond_resched() is usually added for
> > locations that show tremendous latency (where either a watchdog triggered,
> > or showed up in some analysis that had a latency that was much greater than
> > a tick).  
> 
> For non-real-time workloads, the average case is important, not just the
> worst case.  In the new lazily preemptible mode of thought, a preemption
> by a non-real-time task will wait a tick.  Earlier, it would have waited
> for the next cond_resched().  Which, in the average case, might have
> arrived much sooner than one tick.

Or much later. It's random. And what's nice about this model, we can add
more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
say "this task needs to preempt immediately" and not just for RT tasks.

This allows the user to decide which task preempts more and which does not
(defined by the scheduler), instead of some random cond_resched() that can
also preempt a higher priority task that just finished its quota to run a
low priority task causing latency for the higher priority task.

This is what I mean by "think differently".

> 
> > The point is, if/when we switch to the new preemption model, we would need
> > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > done to prevent regressions. But the reasons I see cond_resched() being
> > added today, should no longer exist with this new model.  
> 
> This I agree with.  Also, with the new paradigm and new mode of thought
> in place, it should be safe to drop any cond_resched() that is in a loop
> that consumes more than a tick of CPU time per iteration.

Why does that matter? Is the loop not important? Why stop it from finishing
for some random task that may not be important, and cond_resched() has no
idea if it is or not.

> 
> > > There might be others as well.  These are the possibilities that have
> > > come up thus far.
> > >   
> > > > They all suck and keeping some of them is just counterproductive as
> > > > again people will sprinkle them all over the place for the very wrong
> > > > reasons.    
> > > 
> > > Yes, but do they suck enough and are they counterproductive enough to
> > > be useful and necessary?  ;-)  
> > 
> > They are only useful and necessary because of the way we handle preemption
> > today. With the new preemption model, they are all likely to be useless and
> > unnecessary ;-)  
> 
> The "all likely" needs some demonstration.  I agree that a great many
> of them would be useless and unnecessary.  Maybe even the vast majority.
> But that is different than "all".  ;-)

I'm betting it is "all" ;-) But I also agree that this "needs some
demonstration". We are not there yet, and likely will not be until the
second half of next year. So we have plenty of time to speak rhetorically
to each other!



> > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > As I mentioned above, cond_resched() is usually added when a problem was
> > seen. I really believe that those problems would never had been seen if
> > the new paradigm had already been in place.  
> 
> Indeed, that sort of wording does quite the opposite of raising my
> confidence levels.  ;-)

Yes, I admit the "manager speak" isn't something to brag about here. But I
really do like that word. It's just fun to say (and spell)! Paradigm,
paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
be like gnu, and pronounce it when speaking about free software? Although,
that makes the word sound worse. :-p

> 
> You know, the ancient Romans would have had no problem dealing with the
> dot-com boom, cryptocurrency, some of the shadier areas of artificial
> intelligence and machine learning, and who knows what all else.  As the
> Romans used to say, "Beware of geeks bearing grifts."
> 
> > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > >      absolutely no acceptable solution to solve that unless you think
> > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > >      is anywhere near acceptable.    
> > > 
> > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > argument, you need to find someone else to argue with.  ;-)  
> > 
> > Anyway, there's still a long path before cond_resched() can be removed. It
> > was a mistake by Ankur to add those removals this early (and he has
> > acknowledged that mistake).  
> 
> OK, that I can live with.  But that seems to be a bit different of a
> take than that of some earlier emails in this thread.  ;-)

Well, we are also stating the final goal as well. I think there's some
confusion to what's going to happen immediately and what's going to happen
in the long run.

> 
> > First we need to get the new preemption modeled implemented. When it is, it
> > can be just a config option at first. Then when that config option is set,
> > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > between them at run time as they are just a way to tell the scheduler when
> > to set NEED_RESCHED_LAZY vs NEED_RSECHED.  
> 
> Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> the runtime switching needs to be limited to NONE and VOLUNTARY.
> Which is fine.

But why? Because the run time switches of NONE and VOLUNTARY are no
different than FULL.

Why I say that? Because:

For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
or NEED_RESCHED will be set (of course that one tick may be configurable).
Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.

Even if the user sets the mode to "NONE", after the above scenario (one tick
after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
than PREEMPT_FULL.

So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.

> 
> > At that moment, when that config is set, the cond_resched() can turn into a
> > nop. This will allow for testing to make sure there are no regressions in
> > latency, even with the NONE mode enabled.  
> 
> And once it appears to be reasonably stable (in concept as well as
> implementation), heavy testing should get underway.

Agreed.

> 
> > The real test is implementing the code and seeing how it affects things in
> > the real world. Us arguing about it isn't going to get anywhere.  
> 
> Indeed, the opinion of the objective universe always wins.  It all too
> often takes longer than necessary for the people arguing with each other
> to realize this, but such is life.
> 
> >                                                                  I just
> > don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> > show that there was a real regression with that removal.  
> 
> Fair enough, although a single commit bulk removing a large number of
> cond_resched() calls will likely get a bulk NAK.

We'll see. I now have a goal to hit!

-- Steve
Paul E. McKenney Dec. 6, 2023, 4:07 a.m. UTC | #17
On Tue, Dec 05, 2023 at 12:18:26PM -0800, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
> >> On Mon, 4 Dec 2023 17:01:21 -0800
> >> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>
> >> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
> >> > > Paul!
> >> > >
> >> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
> >> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
> ...
> >> > >   3) Looking at the initial problem Ankur was trying to solve there is
> >> > >      absolutely no acceptable solution to solve that unless you think
> >> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> >> > >      is anywhere near acceptable.
> >> >
> >> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> >> > argument, you need to find someone else to argue with.  ;-)
> >>
> >> Anyway, there's still a long path before cond_resched() can be removed. It
> >> was a mistake by Ankur to add those removals this early (and he has
> >> acknowledged that mistake).
> >
> > OK, that I can live with.  But that seems to be a bit different of a
> > take than that of some earlier emails in this thread.  ;-)
> 
> Heh I think it's just that this thread goes to (far) too many places :).
> 
> As Steven says, the initial series touching everything all together
> was a mistake. V1 adds the new preemption model alongside the existing
> ones locally defines cond_resched() as nop.
> 
> That'll allow us to experiment and figure out where there are latency
> gaps.

Sounds very good!

Again, I am very supportive of the overall direction.  Devils and details
and all that.  ;-)

							Thanx, Paul
David Laight Dec. 6, 2023, 10:08 a.m. UTC | #18
...
> > > One thing I would like to look into with the new work is to have holding a
> > > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > > converted to mutex in the RT kernel). That way you are less likely to be
> > > preempted while holding a mutex.
> >
> > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > in their fastpaths might have workloads that have a contrary opinion.
> 
> I don't understand your above statement. Maybe I wasn't clear with my
> statement? The above is more about PREEMPT_FULL, as it currently will
> preempt immediately. My above comment is that we can have an option for
> PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> it would at least hold off until there's no mutex held. Who cares if it's a
> fast path when a task needs to give up the CPU for another task?
>
> What I
> worry about is scheduling out while holding a mutex which increases the
> chance of that mutex being contended upon. Which does have drastic impact
> on performance.

Indeed.
You really don't want to preempt with a mutex held if it is about to be
released. Unfortunately this really required a CBU (Crytal Ball Unit).

But I don't think the scheduler timer ticks can be anywhere near frequent
enough to do the 'slightly delayed preemption' that seems to being talked
about - not without a massive overhead.

I can think of two typical uses of a mutex.
One is a short code path that doesn't usually sleep, but calls kmalloc()
so might do so. That could be quite 'hot' and you wouldn't really want
extra 'random' preemption.
The other is long paths that are going to wait for IO, any concurrent
call is expected to sleep. Extra preemption here probably won't matter.

So maybe the software should give the scheduler a hint.
Perhaps as a property of the mutex or the acquire request?
Or feeding the time the mutex has been held into the mix.

It is all a bit like the difference between using spinlock() and
spinlock_irqsave() for a lock that will never be held by an ISR.
If the lock is ever likely to be contended you (IMHO) really want
to use the 'irqsave' version in order to stop the waiting thread
spinning for the duration of the ISR (and following softint).
There is (probably) little point worrying about IRQ latency, a
single readl() to our fpga based PCIe slave will spin a 3GHz cpu
for around 3000 clocks - that it is a lot of normal code.
I've had terrible problems avoiding the extra pthread_mutex() hold
times caused by ISR and softint, they must have the same effect on
spin_lock().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Ankur Arora Dec. 7, 2023, 1:33 a.m. UTC | #19
Paul E. McKenney <paulmck@kernel.org> writes:

> On Tue, Dec 05, 2023 at 12:18:26PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@kernel.org> writes:
>>
>> > On Tue, Dec 05, 2023 at 10:01:14AM -0500, Steven Rostedt wrote:
>> >> On Mon, 4 Dec 2023 17:01:21 -0800
>> >> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>> >>
>> >> > On Tue, Nov 28, 2023 at 11:53:19AM +0100, Thomas Gleixner wrote:
>> >> > > Paul!
>> >> > >
>> >> > > On Tue, Nov 21 2023 at 07:19, Paul E. McKenney wrote:
>> >> > > > On Tue, Nov 21, 2023 at 10:00:59AM -0500, Steven Rostedt wrote:
>> ...
>> >> > >   3) Looking at the initial problem Ankur was trying to solve there is
>> >> > >      absolutely no acceptable solution to solve that unless you think
>> >> > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
>> >> > >      is anywhere near acceptable.
>> >> >
>> >> > I am not arguing for allow_preempt()/disallow_preempt(), so for that
>> >> > argument, you need to find someone else to argue with.  ;-)
>> >>
>> >> Anyway, there's still a long path before cond_resched() can be removed. It
>> >> was a mistake by Ankur to add those removals this early (and he has
>> >> acknowledged that mistake).
>> >
>> > OK, that I can live with.  But that seems to be a bit different of a
>> > take than that of some earlier emails in this thread.  ;-)
>>
>> Heh I think it's just that this thread goes to (far) too many places :).
>>
>> As Steven says, the initial series touching everything all together
>> was a mistake. V1 adds the new preemption model alongside the existing
>> ones locally defines cond_resched() as nop.
>>
>> That'll allow us to experiment and figure out where there are latency
>> gaps.
>
> Sounds very good!
>
> Again, I am very supportive of the overall direction.  Devils and details
> and all that.  ;-)

Agreed. And thanks!

--
ankur
Paul E. McKenney Dec. 7, 2023, 4:34 a.m. UTC | #20
On Tue, Dec 05, 2023 at 03:45:18PM -0500, Steven Rostedt wrote:
> On Tue, 5 Dec 2023 11:38:42 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > 
> > > Note that the new preemption model is a new paradigm and we need to start
> > > thinking a bit differently if we go to it.  
> > 
> > We can of course think differently, but existing hardware and software
> > will probably be a bit more stubborn.
> 
> Not at all. I don't see how hardware plays a role here, but how software is
> designed does sometimes require thinking differently.

The hardware runs the software and so gets its say.  And I of course do
agree that changes in software sometimes require thinking differently,
but I can also personally attest to how much work it is and how long it
takes to induce changes in thinking.  ;-)

> > > One thing I would like to look into with the new work is to have holding a
> > > mutex ignore the NEED_RESCHED_LAZY (similar to what is done with spinlock
> > > converted to mutex in the RT kernel). That way you are less likely to be
> > > preempted while holding a mutex.  
> > 
> > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > in their fastpaths might have workloads that have a contrary opinion.
> 
> I don't understand your above statement. Maybe I wasn't clear with my
> statement? The above is more about PREEMPT_FULL, as it currently will
> preempt immediately. My above comment is that we can have an option for
> PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> it would at least hold off until there's no mutex held. Who cares if it's a
> fast path when a task needs to give up the CPU for another task? What I
> worry about is scheduling out while holding a mutex which increases the
> chance of that mutex being contended upon. Which does have drastic impact
> on performance.

As I understand the current mutex_lock() code, the fastpaths leave no
scheduler-visible clue that a mutex is in fact held.  If there is no
such clue, it is quite likely that those fastpaths will need to do some
additional clue-leaving work, increasing their overhead.  And while it
is always possible that this overhead will be down in the noise, if it
was too far down in the noise there would be no need for those fastpaths.

So it is possible (but by no means certain) that some workloads will end
up caring.

> > > > Another is the aforementioned situations where removing the cond_resched()
> > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > thing, and the people I chatted with are all for that, but it is only
> > > > natural that there would be a corresponding level of concern about the
> > > > cases where removing the cond_resched() calls increases latency.  
> > > 
> > > With the "capped preemption" I'm not sure that would still be the case.
> > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > the system had to already be in a situation that a schedule needs to
> > > happen. There's lots of places in the kernel that run for over a tick
> > > without any cond_resched(). The cond_resched() is usually added for
> > > locations that show tremendous latency (where either a watchdog triggered,
> > > or showed up in some analysis that had a latency that was much greater than
> > > a tick).  
> > 
> > For non-real-time workloads, the average case is important, not just the
> > worst case.  In the new lazily preemptible mode of thought, a preemption
> > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > for the next cond_resched().  Which, in the average case, might have
> > arrived much sooner than one tick.
> 
> Or much later. It's random. And what's nice about this model, we can add
> more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> say "this task needs to preempt immediately" and not just for RT tasks.
> 
> This allows the user to decide which task preempts more and which does not
> (defined by the scheduler), instead of some random cond_resched() that can
> also preempt a higher priority task that just finished its quota to run a
> low priority task causing latency for the higher priority task.
> 
> This is what I mean by "think differently".

I did understand your meaning, and it is a source of some concern.  ;-)

When things become sufficiently stable, larger-scale tests will of course
be needed, not just different thought..

> > > The point is, if/when we switch to the new preemption model, we would need
> > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > done to prevent regressions. But the reasons I see cond_resched() being
> > > added today, should no longer exist with this new model.  
> > 
> > This I agree with.  Also, with the new paradigm and new mode of thought
> > in place, it should be safe to drop any cond_resched() that is in a loop
> > that consumes more than a tick of CPU time per iteration.
> 
> Why does that matter? Is the loop not important? Why stop it from finishing
> for some random task that may not be important, and cond_resched() has no
> idea if it is or not.

Because if it takes more than a tick to reach the next cond_resched(),
lazy preemption is likely to preempt before that cond_resched() is
reached.  Which suggests that such a cond_resched() would not be all
that valuable in the new thought paradigm.  Give or take potential issues
with exactly where the preemption happens.

> > > > There might be others as well.  These are the possibilities that have
> > > > come up thus far.
> > > >   
> > > > > They all suck and keeping some of them is just counterproductive as
> > > > > again people will sprinkle them all over the place for the very wrong
> > > > > reasons.    
> > > > 
> > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > be useful and necessary?  ;-)  
> > > 
> > > They are only useful and necessary because of the way we handle preemption
> > > today. With the new preemption model, they are all likely to be useless and
> > > unnecessary ;-)  
> > 
> > The "all likely" needs some demonstration.  I agree that a great many
> > of them would be useless and unnecessary.  Maybe even the vast majority.
> > But that is different than "all".  ;-)
> 
> I'm betting it is "all" ;-) But I also agree that this "needs some
> demonstration". We are not there yet, and likely will not be until the
> second half of next year. So we have plenty of time to speak rhetorically
> to each other!

You know, we usually find time to engage in rhetorical conversation.  ;-)

> > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > As I mentioned above, cond_resched() is usually added when a problem was
> > > seen. I really believe that those problems would never had been seen if
> > > the new paradigm had already been in place.  
> > 
> > Indeed, that sort of wording does quite the opposite of raising my
> > confidence levels.  ;-)
> 
> Yes, I admit the "manager speak" isn't something to brag about here. But I
> really do like that word. It's just fun to say (and spell)! Paradigm,
> paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> be like gnu, and pronounce it when speaking about free software? Although,
> that makes the word sound worse. :-p

Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!

> > You know, the ancient Romans would have had no problem dealing with the
> > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > intelligence and machine learning, and who knows what all else.  As the
> > Romans used to say, "Beware of geeks bearing grifts."
> > 
> > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > >      absolutely no acceptable solution to solve that unless you think
> > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > >      is anywhere near acceptable.    
> > > > 
> > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > argument, you need to find someone else to argue with.  ;-)  
> > > 
> > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > was a mistake by Ankur to add those removals this early (and he has
> > > acknowledged that mistake).  
> > 
> > OK, that I can live with.  But that seems to be a bit different of a
> > take than that of some earlier emails in this thread.  ;-)
> 
> Well, we are also stating the final goal as well. I think there's some
> confusion to what's going to happen immediately and what's going to happen
> in the long run.

If I didn't know better, I might suspect that in addition to the
confusion, there are a few differences of opinion.  ;-)

> > > First we need to get the new preemption modeled implemented. When it is, it
> > > can be just a config option at first. Then when that config option is set,
> > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > between them at run time as they are just a way to tell the scheduler when
> > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.  
> > 
> > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > Which is fine.
> 
> But why? Because the run time switches of NONE and VOLUNTARY are no
> different than FULL.
> 
> Why I say that? Because:
> 
> For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> or NEED_RESCHED will be set (of course that one tick may be configurable).
> Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> 
> Even if the user sets the mode to "NONE", after the above scenario (one tick
> after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> than PREEMPT_FULL.
> 
> So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.

Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
combination.  All of the gains from PREEMPT_FULL=y are more than lost
due to PREEMPT_RCU=n, especially when the kernel decides to do something
like walk a long task list under RCU protection.  We should not waste
people's time getting burned by this combination, nor should we waste
cycles testing it.

> > > At that moment, when that config is set, the cond_resched() can turn into a
> > > nop. This will allow for testing to make sure there are no regressions in
> > > latency, even with the NONE mode enabled.  
> > 
> > And once it appears to be reasonably stable (in concept as well as
> > implementation), heavy testing should get underway.
> 
> Agreed.
> 
> > 
> > > The real test is implementing the code and seeing how it affects things in
> > > the real world. Us arguing about it isn't going to get anywhere.  
> > 
> > Indeed, the opinion of the objective universe always wins.  It all too
> > often takes longer than necessary for the people arguing with each other
> > to realize this, but such is life.
> > 
> > >                                                                  I just
> > > don't want blind NACK. A NACK to a removal of a cond_resched() needs to
> > > show that there was a real regression with that removal.  
> > 
> > Fair enough, although a single commit bulk removing a large number of
> > cond_resched() calls will likely get a bulk NAK.
> 
> We'll see. I now have a goal to hit!

;-) ;-) ;-)

							Thanx, Paul
Steven Rostedt Dec. 7, 2023, 1:44 p.m. UTC | #21
On Wed, 6 Dec 2023 20:34:11 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > > in their fastpaths might have workloads that have a contrary opinion.  
> > 
> > I don't understand your above statement. Maybe I wasn't clear with my
> > statement? The above is more about PREEMPT_FULL, as it currently will
> > preempt immediately. My above comment is that we can have an option for
> > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> > it would at least hold off until there's no mutex held. Who cares if it's a
> > fast path when a task needs to give up the CPU for another task? What I
> > worry about is scheduling out while holding a mutex which increases the
> > chance of that mutex being contended upon. Which does have drastic impact
> > on performance.  
> 
> As I understand the current mutex_lock() code, the fastpaths leave no
> scheduler-visible clue that a mutex is in fact held.  If there is no
> such clue, it is quite likely that those fastpaths will need to do some
> additional clue-leaving work, increasing their overhead.  And while it
> is always possible that this overhead will be down in the noise, if it
> was too far down in the noise there would be no need for those fastpaths.
> 
> So it is possible (but by no means certain) that some workloads will end
> up caring.

OK, that makes more sense, and I do agree with that statement. It would
need to do something like spin locks do with preempt disable, but I agree,
this would need to be done in a way not to cause performance regressions.


> 
> > > > > Another is the aforementioned situations where removing the cond_resched()
> > > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > > thing, and the people I chatted with are all for that, but it is only
> > > > > natural that there would be a corresponding level of concern about the
> > > > > cases where removing the cond_resched() calls increases latency.    
> > > > 
> > > > With the "capped preemption" I'm not sure that would still be the case.
> > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > > the system had to already be in a situation that a schedule needs to
> > > > happen. There's lots of places in the kernel that run for over a tick
> > > > without any cond_resched(). The cond_resched() is usually added for
> > > > locations that show tremendous latency (where either a watchdog triggered,
> > > > or showed up in some analysis that had a latency that was much greater than
> > > > a tick).    
> > > 
> > > For non-real-time workloads, the average case is important, not just the
> > > worst case.  In the new lazily preemptible mode of thought, a preemption
> > > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > > for the next cond_resched().  Which, in the average case, might have
> > > arrived much sooner than one tick.  
> > 
> > Or much later. It's random. And what's nice about this model, we can add
> > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> > say "this task needs to preempt immediately" and not just for RT tasks.
> > 
> > This allows the user to decide which task preempts more and which does not
> > (defined by the scheduler), instead of some random cond_resched() that can
> > also preempt a higher priority task that just finished its quota to run a
> > low priority task causing latency for the higher priority task.
> > 
> > This is what I mean by "think differently".  
> 
> I did understand your meaning, and it is a source of some concern.  ;-)
> 
> When things become sufficiently stable, larger-scale tests will of course
> be needed, not just different thought..

Fair enough.

> 
> > > > The point is, if/when we switch to the new preemption model, we would need
> > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > > done to prevent regressions. But the reasons I see cond_resched() being
> > > > added today, should no longer exist with this new model.    
> > > 
> > > This I agree with.  Also, with the new paradigm and new mode of thought
> > > in place, it should be safe to drop any cond_resched() that is in a loop
> > > that consumes more than a tick of CPU time per iteration.  
> > 
> > Why does that matter? Is the loop not important? Why stop it from finishing
> > for some random task that may not be important, and cond_resched() has no
> > idea if it is or not.  
> 
> Because if it takes more than a tick to reach the next cond_resched(),
> lazy preemption is likely to preempt before that cond_resched() is
> reached.  Which suggests that such a cond_resched() would not be all
> that valuable in the new thought paradigm.  Give or take potential issues
> with exactly where the preemption happens.

I'm just saying there's lots of places that the above happens, which is why
we are still scattering cond_resched() all over the place.

> 
> > > > > There might be others as well.  These are the possibilities that have
> > > > > come up thus far.
> > > > >     
> > > > > > They all suck and keeping some of them is just counterproductive as
> > > > > > again people will sprinkle them all over the place for the very wrong
> > > > > > reasons.      
> > > > > 
> > > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > > be useful and necessary?  ;-)    
> > > > 
> > > > They are only useful and necessary because of the way we handle preemption
> > > > today. With the new preemption model, they are all likely to be useless and
> > > > unnecessary ;-)    
> > > 
> > > The "all likely" needs some demonstration.  I agree that a great many
> > > of them would be useless and unnecessary.  Maybe even the vast majority.
> > > But that is different than "all".  ;-)  
> > 
> > I'm betting it is "all" ;-) But I also agree that this "needs some
> > demonstration". We are not there yet, and likely will not be until the
> > second half of next year. So we have plenty of time to speak rhetorically
> > to each other!  
> 
> You know, we usually find time to engage in rhetorical conversation.  ;-)
> 
> > > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > > As I mentioned above, cond_resched() is usually added when a problem was
> > > > seen. I really believe that those problems would never had been seen if
> > > > the new paradigm had already been in place.    
> > > 
> > > Indeed, that sort of wording does quite the opposite of raising my
> > > confidence levels.  ;-)  
> > 
> > Yes, I admit the "manager speak" isn't something to brag about here. But I
> > really do like that word. It's just fun to say (and spell)! Paradigm,
> > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> > be like gnu, and pronounce it when speaking about free software? Although,
> > that makes the word sound worse. :-p  
> 
> Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!

 Pair a' two-bits : that's all it's worth

Or

 Pair a' two-cents : as it's my two cents that I'm giving.


> 
> > > You know, the ancient Romans would have had no problem dealing with the
> > > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > > intelligence and machine learning, and who knows what all else.  As the
> > > Romans used to say, "Beware of geeks bearing grifts."
> > >   
> > > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > > >      absolutely no acceptable solution to solve that unless you think
> > > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > > >      is anywhere near acceptable.      
> > > > > 
> > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > > argument, you need to find someone else to argue with.  ;-)    
> > > > 
> > > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > > was a mistake by Ankur to add those removals this early (and he has
> > > > acknowledged that mistake).    
> > > 
> > > OK, that I can live with.  But that seems to be a bit different of a
> > > take than that of some earlier emails in this thread.  ;-)  
> > 
> > Well, we are also stating the final goal as well. I think there's some
> > confusion to what's going to happen immediately and what's going to happen
> > in the long run.  
> 
> If I didn't know better, I might suspect that in addition to the
> confusion, there are a few differences of opinion.  ;-)

Confusion enhances differences of opinion.

> 
> > > > First we need to get the new preemption modeled implemented. When it is, it
> > > > can be just a config option at first. Then when that config option is set,
> > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > > between them at run time as they are just a way to tell the scheduler when
> > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.    
> > > 
> > > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > > Which is fine.  
> > 
> > But why? Because the run time switches of NONE and VOLUNTARY are no
> > different than FULL.
> > 
> > Why I say that? Because:
> > 
> > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> > or NEED_RESCHED will be set (of course that one tick may be configurable).
> > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> > 
> > Even if the user sets the mode to "NONE", after the above scenario (one tick
> > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> > than PREEMPT_FULL.
> > 
> > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.  
> 
> Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> combination.  All of the gains from PREEMPT_FULL=y are more than lost
> due to PREEMPT_RCU=n, especially when the kernel decides to do something
> like walk a long task list under RCU protection.  We should not waste
> people's time getting burned by this combination, nor should we waste
> cycles testing it.

The issue I see here is that PREEMPT_RCU is not something that we can
convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can
be. And you have stated that PREEMPT_RCU adds some more overhead that
people may not care about. But even though you say PREEMPT_RCU=n makes no
sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially
if we have to make sure that it still works (even NONE and VOLUNTARY turn
to FULL after that one-tick).

Remember, what we are looking at is having:

N : NEED_RESCHED - schedule at next possible location
L : NEED_RESCHED_LAZY - schedule when going into user space.

When to set what for a task needing to schedule?

 Model           SCHED_OTHER         RT/DL(or user specified)
 -----           -----------         ------------------------
 NONE                 L                         L
 VOLUNTARY            L                         N
 FULL                 N                         N

By saying FULL, you are saying that you want the SCHED_OTHER as well as
RT/DL tasks to schedule as soon as possible and not wait to going into user
space. This is still applicable even with PREEMPT_RCU=n

It may be that someone wants better latency for all tasks (like VOLUNTARY)
but not the overhead that PREEMPT_RCU gives, and is OK with the added
latency as a result.

-- Steve
Paul E. McKenney Dec. 8, 2023, 4:28 a.m. UTC | #22
On Thu, Dec 07, 2023 at 08:44:57AM -0500, Steven Rostedt wrote:
> On Wed, 6 Dec 2023 20:34:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > > > in their fastpaths might have workloads that have a contrary opinion.  
> > > 
> > > I don't understand your above statement. Maybe I wasn't clear with my
> > > statement? The above is more about PREEMPT_FULL, as it currently will
> > > preempt immediately. My above comment is that we can have an option for
> > > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> > > it would at least hold off until there's no mutex held. Who cares if it's a
> > > fast path when a task needs to give up the CPU for another task? What I
> > > worry about is scheduling out while holding a mutex which increases the
> > > chance of that mutex being contended upon. Which does have drastic impact
> > > on performance.  
> > 
> > As I understand the current mutex_lock() code, the fastpaths leave no
> > scheduler-visible clue that a mutex is in fact held.  If there is no
> > such clue, it is quite likely that those fastpaths will need to do some
> > additional clue-leaving work, increasing their overhead.  And while it
> > is always possible that this overhead will be down in the noise, if it
> > was too far down in the noise there would be no need for those fastpaths.
> > 
> > So it is possible (but by no means certain) that some workloads will end
> > up caring.
> 
> OK, that makes more sense, and I do agree with that statement. It would
> need to do something like spin locks do with preempt disable, but I agree,
> this would need to be done in a way not to cause performance regressions.

Whew!  ;-)

> > > > > > Another is the aforementioned situations where removing the cond_resched()
> > > > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > > > thing, and the people I chatted with are all for that, but it is only
> > > > > > natural that there would be a corresponding level of concern about the
> > > > > > cases where removing the cond_resched() calls increases latency.    
> > > > > 
> > > > > With the "capped preemption" I'm not sure that would still be the case.
> > > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > > > the system had to already be in a situation that a schedule needs to
> > > > > happen. There's lots of places in the kernel that run for over a tick
> > > > > without any cond_resched(). The cond_resched() is usually added for
> > > > > locations that show tremendous latency (where either a watchdog triggered,
> > > > > or showed up in some analysis that had a latency that was much greater than
> > > > > a tick).    
> > > > 
> > > > For non-real-time workloads, the average case is important, not just the
> > > > worst case.  In the new lazily preemptible mode of thought, a preemption
> > > > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > > > for the next cond_resched().  Which, in the average case, might have
> > > > arrived much sooner than one tick.  
> > > 
> > > Or much later. It's random. And what's nice about this model, we can add
> > > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> > > say "this task needs to preempt immediately" and not just for RT tasks.
> > > 
> > > This allows the user to decide which task preempts more and which does not
> > > (defined by the scheduler), instead of some random cond_resched() that can
> > > also preempt a higher priority task that just finished its quota to run a
> > > low priority task causing latency for the higher priority task.
> > > 
> > > This is what I mean by "think differently".  
> > 
> > I did understand your meaning, and it is a source of some concern.  ;-)
> > 
> > When things become sufficiently stable, larger-scale tests will of course
> > be needed, not just different thought..
> 
> Fair enough.
> 
> > 
> > > > > The point is, if/when we switch to the new preemption model, we would need
> > > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > > > done to prevent regressions. But the reasons I see cond_resched() being
> > > > > added today, should no longer exist with this new model.    
> > > > 
> > > > This I agree with.  Also, with the new paradigm and new mode of thought
> > > > in place, it should be safe to drop any cond_resched() that is in a loop
> > > > that consumes more than a tick of CPU time per iteration.  
> > > 
> > > Why does that matter? Is the loop not important? Why stop it from finishing
> > > for some random task that may not be important, and cond_resched() has no
> > > idea if it is or not.  
> > 
> > Because if it takes more than a tick to reach the next cond_resched(),
> > lazy preemption is likely to preempt before that cond_resched() is
> > reached.  Which suggests that such a cond_resched() would not be all
> > that valuable in the new thought paradigm.  Give or take potential issues
> > with exactly where the preemption happens.
> 
> I'm just saying there's lots of places that the above happens, which is why
> we are still scattering cond_resched() all over the place.

And I agree that greatly reducing (if not eliminating) such scattering
is a great benefit of lazy preemption.

> > > > > > There might be others as well.  These are the possibilities that have
> > > > > > come up thus far.
> > > > > >     
> > > > > > > They all suck and keeping some of them is just counterproductive as
> > > > > > > again people will sprinkle them all over the place for the very wrong
> > > > > > > reasons.      
> > > > > > 
> > > > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > > > be useful and necessary?  ;-)    
> > > > > 
> > > > > They are only useful and necessary because of the way we handle preemption
> > > > > today. With the new preemption model, they are all likely to be useless and
> > > > > unnecessary ;-)    
> > > > 
> > > > The "all likely" needs some demonstration.  I agree that a great many
> > > > of them would be useless and unnecessary.  Maybe even the vast majority.
> > > > But that is different than "all".  ;-)  
> > > 
> > > I'm betting it is "all" ;-) But I also agree that this "needs some
> > > demonstration". We are not there yet, and likely will not be until the
> > > second half of next year. So we have plenty of time to speak rhetorically
> > > to each other!  
> > 
> > You know, we usually find time to engage in rhetorical conversation.  ;-)
> > 
> > > > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > > > As I mentioned above, cond_resched() is usually added when a problem was
> > > > > seen. I really believe that those problems would never had been seen if
> > > > > the new paradigm had already been in place.    
> > > > 
> > > > Indeed, that sort of wording does quite the opposite of raising my
> > > > confidence levels.  ;-)  
> > > 
> > > Yes, I admit the "manager speak" isn't something to brag about here. But I
> > > really do like that word. It's just fun to say (and spell)! Paradigm,
> > > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> > > be like gnu, and pronounce it when speaking about free software? Although,
> > > that makes the word sound worse. :-p  
> > 
> > Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!
> 
>  Pair a' two-bits : that's all it's worth
> 
> Or
> 
>  Pair a' two-cents : as it's my two cents that I'm giving.

I must confess that the occasional transliteration of paradigm to
pair-of-dimes has been a great sanity-preservation device over the
decades.  ;-)

> > > > You know, the ancient Romans would have had no problem dealing with the
> > > > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > > > intelligence and machine learning, and who knows what all else.  As the
> > > > Romans used to say, "Beware of geeks bearing grifts."
> > > >   
> > > > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > > > >      absolutely no acceptable solution to solve that unless you think
> > > > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > > > >      is anywhere near acceptable.      
> > > > > > 
> > > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > > > argument, you need to find someone else to argue with.  ;-)    
> > > > > 
> > > > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > > > was a mistake by Ankur to add those removals this early (and he has
> > > > > acknowledged that mistake).    
> > > > 
> > > > OK, that I can live with.  But that seems to be a bit different of a
> > > > take than that of some earlier emails in this thread.  ;-)  
> > > 
> > > Well, we are also stating the final goal as well. I think there's some
> > > confusion to what's going to happen immediately and what's going to happen
> > > in the long run.  
> > 
> > If I didn't know better, I might suspect that in addition to the
> > confusion, there are a few differences of opinion.  ;-)
> 
> Confusion enhances differences of opinion.

That can happen, but then again confusion can also result in the
mere appearance of agreement.  ;-)

> > > > > First we need to get the new preemption modeled implemented. When it is, it
> > > > > can be just a config option at first. Then when that config option is set,
> > > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > > > between them at run time as they are just a way to tell the scheduler when
> > > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.    
> > > > 
> > > > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > > > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > > > Which is fine.  
> > > 
> > > But why? Because the run time switches of NONE and VOLUNTARY are no
> > > different than FULL.
> > > 
> > > Why I say that? Because:
> > > 
> > > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> > > or NEED_RESCHED will be set (of course that one tick may be configurable).
> > > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> > > 
> > > Even if the user sets the mode to "NONE", after the above scenario (one tick
> > > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> > > than PREEMPT_FULL.
> > > 
> > > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> > > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> > > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.  
> > 
> > Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> > combination.  All of the gains from PREEMPT_FULL=y are more than lost
> > due to PREEMPT_RCU=n, especially when the kernel decides to do something
> > like walk a long task list under RCU protection.  We should not waste
> > people's time getting burned by this combination, nor should we waste
> > cycles testing it.
> 
> The issue I see here is that PREEMPT_RCU is not something that we can
> convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can
> be. And you have stated that PREEMPT_RCU adds some more overhead that
> people may not care about. But even though you say PREEMPT_RCU=n makes no
> sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially
> if we have to make sure that it still works (even NONE and VOLUNTARY turn
> to FULL after that one-tick).
> 
> Remember, what we are looking at is having:
> 
> N : NEED_RESCHED - schedule at next possible location
> L : NEED_RESCHED_LAZY - schedule when going into user space.
> 
> When to set what for a task needing to schedule?
> 
>  Model           SCHED_OTHER         RT/DL(or user specified)
>  -----           -----------         ------------------------
>  NONE                 L                         L
>  VOLUNTARY            L                         N
>  FULL                 N                         N
> 
> By saying FULL, you are saying that you want the SCHED_OTHER as well as
> RT/DL tasks to schedule as soon as possible and not wait to going into user
> space. This is still applicable even with PREEMPT_RCU=n
> 
> It may be that someone wants better latency for all tasks (like VOLUNTARY)
> but not the overhead that PREEMPT_RCU gives, and is OK with the added
> latency as a result.

Given the additional testing burden and given the likelihood that it won't
do what people want, let's find someone who really needs it (as opposed
to someone who merely wants it) before allowing it to be selected.
It is after all an easy check far from any fastpath to prevent the
combination of PREEMPT_RCU and PREEMPT_FULL.

							Thanx, Paul
diff mbox series

Patch

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 126f6b418f6a..75aaa6294421 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -104,7 +104,7 @@  extern int rcu_scheduler_active;
 void rcu_end_inkernel_boot(void);
 bool rcu_inkernel_boot_has_ended(void);
 bool rcu_is_watching(void);
-#ifndef CONFIG_PREEMPTION
+#ifndef CONFIG_PREEMPT
 void rcu_all_qs(void);
 #endif
 
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bdd7eadb33d8..a808cb29ab7c 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@  config TREE_RCU
 
 config PREEMPT_RCU
 	bool
-	default y if PREEMPTION
+	default y if PREEMPT || PREEMPT_RT
 	select TREE_RCU
 	help
 	  This option selects the RCU implementation that is
@@ -31,7 +31,7 @@  config PREEMPT_RCU
 
 config TINY_RCU
 	bool
-	default y if !PREEMPTION && !SMP
+	default y if !PREEMPT && !SMP
 	help
 	  This option selects the RCU implementation that is
 	  designed for UP systems from which real-time response