diff mbox series

[v2,7/9] sched: define TIF_ALLOW_RESCHED

Message ID 20230830184958.2333078-8-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series x86/clear_huge_page: multi-page clearing | expand

Commit Message

Ankur Arora Aug. 30, 2023, 6:49 p.m. UTC
On preempt_model_none() or preempt_model_voluntary() configurations
rescheduling of kernel threads happens only when they allow it, and
only at explicit preemption points, via calls to cond_resched() or
similar.

That leaves out contexts where it is not convenient to periodically
call cond_resched() -- for instance when executing a potentially long
running primitive (such as REP; STOSB.)

This means that we either suffer high scheduling latency or avoid
certain constructs.

Define TIF_ALLOW_RESCHED to demarcate such sections.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/thread_info.h |  2 ++
 include/linux/sched.h              | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Comments

Peter Zijlstra Sept. 8, 2023, 7:02 a.m. UTC | #1
On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote:

> +#ifdef TIF_RESCHED_ALLOW
> +/*
> + * allow_resched() .. disallow_resched() demarcate a preemptible section.
> + *
> + * Used around primitives where it might not be convenient to periodically
> + * call cond_resched().
> + */
> +static inline void allow_resched(void)
> +{
> +	might_sleep();
> +	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);

So the might_sleep() ensures we're not currently having preemption
disabled; but there's nothing that ensures we don't do stupid things
like:

	allow_resched();
	spin_lock();
	...
	spin_unlock();
	disallow_resched();

Which on a PREEMPT_COUNT=n build will cause preemption while holding the
spinlock. I think something like the below will cause sufficient
warnings to avoid growing patterns like that.


Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -5834,6 +5834,13 @@ void preempt_count_add(int val)
 {
 #ifdef CONFIG_DEBUG_PREEMPT
 	/*
+	 * Disabling preemption under TIF_RESCHED_ALLOW doesn't
+	 * work for PREEMPT_COUNT=n builds.
+	 */
+	if (WARN_ON(resched_allowed()))
+		return;
+
+	/*
 	 * Underflow?
 	 */
 	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
Linus Torvalds Sept. 8, 2023, 5:15 p.m. UTC | #2
On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> spinlock. I think something like the below will cause sufficient
> warnings to avoid growing patterns like that.

Hmm. I don't think that warning is valid.

Disabling preemption is actually fine if it's done in an interrupt,
iow if we have

        allow_resched();
           -> irq happens
                spin_lock();  // Ok and should *not* complain
                ...
                spin_unlock();
            <- irq return (and preemption)

which actually makes me worry about the nested irq case, because this
would *not* be ok:

        allow_resched();
           -> irq happens
                -> *nested* irq happens
                <- nested irq return (and preemption)

ie the allow_resched() needs to still honor the irq count, and a
nested irq return obviously must not cause any preemption.

I've lost sight of the original patch series, and I assume / hope that
the above isn't actually an issue, but exactly because I've lost sight
of the original patches and only have this one in my mailbox I wanted
to check.

            Linus
Peter Zijlstra Sept. 8, 2023, 10:50 p.m. UTC | #3
On Fri, Sep 08, 2023 at 10:15:07AM -0700, Linus Torvalds wrote:
> On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> > spinlock. I think something like the below will cause sufficient
> > warnings to avoid growing patterns like that.
> 
> Hmm. I don't think that warning is valid.
> 
> Disabling preemption is actually fine if it's done in an interrupt,
> iow if we have
> 
>         allow_resched();
>            -> irq happens
>                 spin_lock();  // Ok and should *not* complain
>                 ...
>                 spin_unlock();
>             <- irq return (and preemption)

Indeed.

> 
> which actually makes me worry about the nested irq case, because this
> would *not* be ok:
> 
>         allow_resched();
>            -> irq happens
>                 -> *nested* irq happens
>                 <- nested irq return (and preemption)
> 
> ie the allow_resched() needs to still honor the irq count, and a
> nested irq return obviously must not cause any preemption.

I think we killed nested interrupts a fair number of years ago, but I'll
recheck -- but not today, sleep is imminent.
Linus Torvalds Sept. 9, 2023, 5:15 a.m. UTC | #4
On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > which actually makes me worry about the nested irq case, because this
> > would *not* be ok:
> >
> >         allow_resched();
> >            -> irq happens
> >                 -> *nested* irq happens
> >                 <- nested irq return (and preemption)
> >
> > ie the allow_resched() needs to still honor the irq count, and a
> > nested irq return obviously must not cause any preemption.
>
> I think we killed nested interrupts a fair number of years ago, but I'll
> recheck -- but not today, sleep is imminent.

I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED
thing needs to look out for any nested exception (ie only ever trigger
if it's returning to the kernel "task" stack).

Because I could easily see us wanting to do "I'm going a big user
copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on",
and then instead of that first "irq happens", you have "page fault
happens" instead.

And inside that page fault handling you may well have critical
sections (like a spinlock) that is fine - but the fact that the
"process context" had TIF_ALLOW_RESCHED most certainly does *not* mean
that the page fault handler can reschedule.

Maybe it already does. As mentioned, I lost sight of the patch series,
even though I saw it originally (and liked it - only realizing on your
complaint that it migth be more dangerous than I thought).

Basically, the "allow resched" should be a marker for a single context
level only. Kind of like a register state bit that gets saved on the
exception stack. Not a "anything happening within this process is now
preemptible".

I'm hoping Ankur will just pipe in and say "of course I already
implemented it that way, see XYZ".

              Linus
Ankur Arora Sept. 9, 2023, 5:30 a.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
>> spinlock. I think something like the below will cause sufficient
>> warnings to avoid growing patterns like that.
>
> Hmm. I don't think that warning is valid.
>
> Disabling preemption is actually fine if it's done in an interrupt,
> iow if we have
>
>         allow_resched();
>            -> irq happens
>                 spin_lock();  // Ok and should *not* complain
>                 ...
>                 spin_unlock();
>             <- irq return (and preemption)
>
> which actually makes me worry about the nested irq case, because this
> would *not* be ok:
>
>         allow_resched();
>            -> irq happens
>                 -> *nested* irq happens
>                 <- nested irq return (andapreemption)
>
> ie the allow_resched() needs to still honor the irq count, and a
> nested irq return obviously must not cause any preemption.

IIUC, this should be equivalent to:

01         allow_resched();
02            -> irq happens
03               preempt_count_add(HARDIRQ_OFFSET);
04                -> nested irq happens
05                   preempt_count_add(HARDIRQ_OFFSET);
06
07                   preempt_count_sub(HARDIRQ_OFFSET);
08                 <- nested irq return
09               preempt_count_sub(HARDIRQ_OFFSET);

So, even if there were nested interrrupts, then the !preempt_count()
check in raw_irqentry_exit_cond_resched() should ensure that no
preemption happens until after line 09.

> I've lost sight of the original patch series, and I assume / hope that
> the above isn't actually an issue, but exactly because I've lost sight
> of the original patches and only have this one in my mailbox I wanted
> to check.

Yeah, sorry about that. The irqentry_exit_allow_resched() is pretty much
this:

+void irqentry_exit_allow_resched(void)
+{
+	if (resched_allowed())
+		raw_irqentry_exit_cond_resched();
+}

So, as long as raw_irqentry_exit_cond_resched() won't allow early
preemption, having allow_resched() set, shouldn't either.

--
ankur
Ankur Arora Sept. 9, 2023, 6:39 a.m. UTC | #6
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > which actually makes me worry about the nested irq case, because this
>> > would *not* be ok:
>> >
>> >         allow_resched();
>> >            -> irq happens
>> >                 -> *nested* irq happens
>> >                 <- nested irq return (and preemption)
>> >
>> > ie the allow_resched() needs to still honor the irq count, and a
>> > nested irq return obviously must not cause any preemption.
>>
>> I think we killed nested interrupts a fair number of years ago, but I'll
>> recheck -- but not today, sleep is imminent.
>
> I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED
> thing needs to look out for any nested exception (ie only ever trigger
> if it's returning to the kernel "task" stack).
>
> Because I could easily see us wanting to do "I'm going a big user
> copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on",
> and then instead of that first "irq happens", you have "page fault
> happens" instead.
>
> And inside that page fault handling you may well have critical
> sections (like a spinlock) that is fine - but the fact that the
> "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean
> that the page fault handler can reschedule.
>
> Maybe it already does. As mentioned, I lost sight of the patch series,
> even though I saw it originally (and liked it - only realizing on your
> complaint that it migth be more dangerous than I thought).
>
> Basically, the "allow resched" should be a marker for a single context
> level only. Kind of like a register state bit that gets saved on the
> exception stack. Not a "anything happening within this process is now
> preemptible".

Yeah, exactly. Though, not even a single context level, but a flag
attached to a single context at the process level only. Using
preempt_count() == 0 as the preemption boundary.

However, this has a problem with the PREEMPT_COUNT=n case because that
doesn't have a preemption boundary.

In the example that Peter gave:

        allow_resched();
        spin_lock();
            -> irq happens
            <- irq returns

            ---> preemption happens
        spin_unlock();
        disallow_resched();

So, here the !preempt_count() clause in raw_irqentry_exit_cond_resched()
won't protect us.

My thinking was to restrict allow_resched() to be used only around
primitive operations. But, I couldn't think of any way to enforce that.

I think the warning in preempt_count_add() as Peter suggested
upthread is a good idea. But, that's only for CONFIG_DEBUG_PREEMPT.


--
ankur
Peter Zijlstra Sept. 9, 2023, 9:11 a.m. UTC | #7
On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote:

> Yeah, exactly. Though, not even a single context level, but a flag
> attached to a single context at the process level only. Using
> preempt_count() == 0 as the preemption boundary.
> 
> However, this has a problem with the PREEMPT_COUNT=n case because that
> doesn't have a preemption boundary.

So, with a little sleep, the nested exception/interrupt case should be
good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment
preempt_count with HARDIRQ_OFFSET / NMI_OFFSET.

So while regular preempt_{dis,en}able() will turn into a NOP, the entry
code *will* continue to increment preempt_count.
Peter Zijlstra Sept. 9, 2023, 9:12 a.m. UTC | #8
On Fri, Sep 08, 2023 at 10:30:57PM -0700, Ankur Arora wrote:

> > which actually makes me worry about the nested irq case, because this
> > would *not* be ok:
> >
> >         allow_resched();
> >            -> irq happens
> >                 -> *nested* irq happens
> >                 <- nested irq return (andapreemption)
> >
> > ie the allow_resched() needs to still honor the irq count, and a
> > nested irq return obviously must not cause any preemption.
> 
> IIUC, this should be equivalent to:
> 
> 01         allow_resched();
> 02            -> irq happens
> 03               preempt_count_add(HARDIRQ_OFFSET);
> 04                -> nested irq happens
> 05                   preempt_count_add(HARDIRQ_OFFSET);
> 06
> 07                   preempt_count_sub(HARDIRQ_OFFSET);
> 08                 <- nested irq return
> 09               preempt_count_sub(HARDIRQ_OFFSET);
> 
> So, even if there were nested interrrupts, then the !preempt_count()
> check in raw_irqentry_exit_cond_resched() should ensure that no
> preemption happens until after line 09.

Yes, this.
Ankur Arora Sept. 9, 2023, 8:04 p.m. UTC | #9
Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote:
>
>> Yeah, exactly. Though, not even a single context level, but a flag
>> attached to a single context at the process level only. Using
>> preempt_count() == 0 as the preemption boundary.
>>
>> However, this has a problem with the PREEMPT_COUNT=n case because that
>> doesn't have a preemption boundary.
>
> So, with a little sleep, the nested exception/interrupt case should be
> good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment
> preempt_count with HARDIRQ_OFFSET / NMI_OFFSET.
>
> So while regular preempt_{dis,en}able() will turn into a NOP, the entry
> code *will* continue to increment preempt_count.

Right, I was talking about the regular preempt_disable()/_enable() that
will turn into a NOP with PREEMPT_COUNT=n.

Actually, let me reply to the mail where you had described this case.

--
ankur
Ankur Arora Sept. 9, 2023, 8:15 p.m. UTC | #10
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote:
>
>> +#ifdef TIF_RESCHED_ALLOW
>> +/*
>> + * allow_resched() .. disallow_resched() demarcate a preemptible section.
>> + *
>> + * Used around primitives where it might not be convenient to periodically
>> + * call cond_resched().
>> + */
>> +static inline void allow_resched(void)
>> +{
>> +	might_sleep();
>> +	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
>
> So the might_sleep() ensures we're not currently having preemption
> disabled; but there's nothing that ensures we don't do stupid things
> like:
>
> 	allow_resched();
> 	spin_lock();
> 	...
> 	spin_unlock();
> 	disallow_resched();
>
> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> spinlock. I think something like the below will cause sufficient
> warnings to avoid growing patterns like that.

Yeah, I agree this is a problem. I'll expand on the comment above
allow_resched() detailing this scenario.

> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -5834,6 +5834,13 @@ void preempt_count_add(int val)
>  {
>  #ifdef CONFIG_DEBUG_PREEMPT
>  	/*
> +	 * Disabling preemption under TIF_RESCHED_ALLOW doesn't
> +	 * work for PREEMPT_COUNT=n builds.
> +	 */
> +	if (WARN_ON(resched_allowed()))
> +		return;
> +
> +	/*
>  	 * Underflow?
>  	 */
>  	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))

And, maybe something like this to guard against __this_cpu_read()
etc:

diff --git a/lib/smp_processor_id.c b/lib/smp_processor_id.c
index a2bb7738c373..634788f16e9e 100644
--- a/lib/smp_processor_id.c
+++ b/lib/smp_processor_id.c
@@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
 {
        int this_cpu = raw_smp_processor_id();

+       if (unlikely(resched_allowed()))
+               goto out_error;
+
        if (likely(preempt_count()))
                goto out;

@@ -33,6 +36,7 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
        if (system_state < SYSTEM_SCHEDULING)
                goto out;

+out_error:
        /*
         * Avoid recursion:
         */

--
ankur
Linus Torvalds Sept. 9, 2023, 9:16 p.m. UTC | #11
On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> > +     if (WARN_ON(resched_allowed()))
> > +             return;
>
> And, maybe something like this to guard against __this_cpu_read()
> etc:
>
> +++ b/lib/smp_processor_id.c
> @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
>  {
>         int this_cpu = raw_smp_processor_id();
>
> +       if (unlikely(resched_allowed()))
> +               goto out_error;

Again, both of those checks are WRONG.

They'll error out even in exceptions / interrupts, when we have a
preempt count already from the exception itself.

So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW
bit is wrong, wrong, wrong.

These situations aren't errors if we already had a preemption count
for other reasons. Only trying to disable preemption when in process
context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing
the check for "are we in a process context" part.

                Linus
Ankur Arora Sept. 10, 2023, 3:48 a.m. UTC | #12
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> > +     if (WARN_ON(resched_allowed()))
>> > +             return;
>>
>> And, maybe something like this to guard against __this_cpu_read()
>> etc:
>>
>> +++ b/lib/smp_processor_id.c
>> @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
>>  {
>>         int this_cpu = raw_smp_processor_id();
>>
>> +       if (unlikely(resched_allowed()))
>> +               goto out_error;
>
> Again, both of those checks are WRONG.
>
> They'll error out even in exceptions / interrupts, when we have a
> preempt count already from the exception itself.
>
> So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW
> bit is wrong, wrong, wrong.

Yeah, you are right.

I think we can keep these checks, but with this fixed definition of
resched_allowed(). This might be better:

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)

 static __always_inline bool resched_allowed(void)
 {
-       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+       return unlikely(!preempt_count() &&
+                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
 }

Ankur

> These situations aren't errors if we already had a preemption count
> for other reasons. Only trying to disable preemption when in process
> context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing
> the check for "are we in a process context" part.
>
>                 Linus
Linus Torvalds Sept. 10, 2023, 4:35 a.m. UTC | #13
On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> I think we can keep these checks, but with this fixed definition of
> resched_allowed(). This might be better:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>
>  static __always_inline bool resched_allowed(void)
>  {
> -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> +       return unlikely(!preempt_count() &&
> +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>  }

I'm not convinced (at all) that the preempt count is the right thing.

It works for interrupts, yes, because interrupts will increment the
preempt count even on non-preempt kernels (since the preempt count is
also the interrupt context level).

But what about any synchronous trap handling?

In other words, just something like a page fault? A page fault doesn't
increment the preemption count (and in fact many page faults _will_
obviously re-schedule as part of waiting for IO).

A page fault can *itself* say "feel free to preempt me", and that's one thing.

But a page fault can also *interupt* something that said "feel free to
preempt me", and that's a completely *different* thing.

So I feel like the "tsk_thread_flag" was sadly completely the wrong
place to add this bit to, and the wrong place to test it in. What we
really want is "current kernel entry context".

So the right thing to do would basically be to put it in the stack
frame at kernel entry - whether that kernel entry was a system call
(which is doing some big copy that should be preemptible without us
having to add "cond_resched()" in places), or is a page fault (which
will also do things like big page clearings for hugepages)

And we don't have that, do we?

We have "on_thread_stack()", which checks for "are we on the system
call stack". But that doesn't work for page faults.

PeterZ - I feel like I might be either very confused, or missing
something You probably go "Duh, Linus, you're off on one of your crazy
tangents, and none of this is relevant because..."

                 Linus
Ankur Arora Sept. 10, 2023, 10:01 a.m. UTC | #14
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> I think we can keep these checks, but with this fixed definition of
>> resched_allowed(). This might be better:
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>>
>>  static __always_inline bool resched_allowed(void)
>>  {
>> -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>> +       return unlikely(!preempt_count() &&
>> +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>>  }
>
> I'm not convinced (at all) that the preempt count is the right thing.
>
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
>
> But what about any synchronous trap handling?
>
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).
>
> A page fault can *itself* say "feel free to preempt me", and that's one thing.
>
> But a page fault can also *interupt* something that said "feel free to
> preempt me", and that's a completely *different* thing.
>
> So I feel like the "tsk_thread_flag" was sadly completely the wrong
> place to add this bit to, and the wrong place to test it in. What we
> really want is "current kernel entry context".

So, what we want allow_resched() to say is: feel free to reschedule
if in a reschedulable context.

The problem with doing that with an allow_resched tsk_thread_flag is
that the flag is really only valid while it is executing in the context
it was set.
And, trying to validate the flag by checking the preempt_count() makes
it pretty fragile, given that now we are tying it with the specifics of
whether the handling of arbitrary interrupts bumps up the
preempt_count() or not.

> So the right thing to do would basically be to put it in the stack
> frame at kernel entry - whether that kernel entry was a system call
> (which is doing some big copy that should be preemptible without us
> having to add "cond_resched()" in places), or is a page fault (which
> will also do things like big page clearings for hugepages)

Seems to me that associating an allow_resched flag with the stack also
has similar issue. Couldn't the context level change while we are on the
same stack?

I guess the problem is that allow_resched()/disallow_resched() really
need to demarcate a section of code having some property, but instead
set up state that has much wider scope.

Maybe code that allows resched can be in a new .section ".text.resched"
or whatever, and we could use something like this as a check:

  int resched_allowed(regs) {
        return !preempt_count() && in_resched_function(regs->rip);
  }

(allow_resched()/disallow_resched() shouldn't be needed except for
debug checks.)

We still need the !preempt_count() check, but now both the conditions
in the test express two orthogonal ideas:

  - !preempt_count(): preemption is safe in the current context
  - in_resched_function(regs->rip): okay to reschedule here

So in this example, it should allow scheduling inside both the
clear_pages_reschedulable() calls:

  -> page_fault()
     clear_page_reschedulable();
     -> page_fault()
        clear_page_reschedulable();

Here though, rescheduling could happen only in the first call to
clear_page_reschedulable():

  -> page_fault()
     clear_page_reschedulable();
     -> hardirq()
         -> page_fault()
            clear_page_reschedulable();

Does that make any sense, or I'm just talking through my hat?

--
ankur
Linus Torvalds Sept. 10, 2023, 6:32 p.m. UTC | #15
On Sun, 10 Sept 2023 at 03:01, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Seems to me that associating an allow_resched flag with the stack also
> has similar issue. Couldn't the context level change while we are on the
> same stack?

On x86-64 no, but in other situations yes.

> I guess the problem is that allow_resched()/disallow_resched() really
> need to demarcate a section of code having some property, but instead
> set up state that has much wider scope.
>
> Maybe code that allows resched can be in a new .section ".text.resched"
> or whatever, and we could use something like this as a check:

Yes. I'm starting to think that that the only sane solution is to
limit cases that can do this a lot, and the "instruciton pointer
region" approach would certainly work.

At the same time I really hate that, because I was hoping we'd be able
to use this to not have so many of those annoying and random
"cond_resched()" calls.

I literally merged another batch of "random stupid crap" an hour ago:
commit 3d7d72a34e05 ("x86/sgx: Break up long non-preemptible delays in
sgx_vepc_release()") literally just adds manual 'cond_resched()' calls
in random places.

I was hoping that we'd have some generic way to deal with this where
we could just say "this thing is reschedulable", and get rid of - or
at least not increasingly add to - the cond_resched() mess.

Of course, that was probably always unrealistic, and those random
cond_resched() calls we just added probably couldn't just be replaced
by "you can reschedule me" simply because the functions quite possibly
end up taking some lock hidden in one of the xa_xyz() functions.

For the _particular_ case of "give me a preemptible big memory copy /
clear", the section model seems fine. It's just that we do have quite
a bit of code where we can end up with long loops that want that
cond_resched() too that I was hoping we'd _also_ be able to solve.

                   Linus
Peter Zijlstra Sept. 11, 2023, 3:04 p.m. UTC | #16
On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:

> I was hoping that we'd have some generic way to deal with this where
> we could just say "this thing is reschedulable", and get rid of - or
> at least not increasingly add to - the cond_resched() mess.

Isn't that called PREEMPT=y ? That tracks precisely all the constraints
required to know when/if we can preempt.

The whole voluntary preempt model is basically the traditional
co-operative preemption model and that fully relies on manual yields.

The problem with the REP prefix (and Xen hypercalls) is that
they're long running instructions and it becomes fundamentally
impossible to put a cond_resched() in.

> Yes. I'm starting to think that that the only sane solution is to
> limit cases that can do this a lot, and the "instruciton pointer
> region" approach would certainly work.

From a code locality / I-cache POV, I think a sorted list of
(non overlapping) ranges might be best.
Andrew Cooper Sept. 11, 2023, 4:29 p.m. UTC | #17
On 11/09/2023 4:04 pm, Peter Zijlstra wrote:
> On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
>
>> I was hoping that we'd have some generic way to deal with this where
>> we could just say "this thing is reschedulable", and get rid of - or
>> at least not increasingly add to - the cond_resched() mess.
> Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> required to know when/if we can preempt.
>
> The whole voluntary preempt model is basically the traditional
> co-operative preemption model and that fully relies on manual yields.
>
> The problem with the REP prefix (and Xen hypercalls) is that
> they're long running instructions and it becomes fundamentally
> impossible to put a cond_resched() in.

Any VMM - Xen isn't special here.

And if we're talking about instructions, then CPUID, GETSEC and
ENCL{S,U} and plenty of {RD,WR}MSRs in in a similar category, being
effectively blocking RPC operations to something else in the platform.

The Xen evtchn upcall logic in Linux does cond_resched() when possible. 
i.e. long-running hypercalls issued with interrupts enabled can
reschedule if an interrupt occurs, which is pretty close to how REP
works too.

~Andrew
Steven Rostedt Sept. 11, 2023, 4:48 p.m. UTC | #18
On Sat, 9 Sep 2023 21:35:54 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > I think we can keep these checks, but with this fixed definition of
> > resched_allowed(). This might be better:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
> >
> >  static __always_inline bool resched_allowed(void)
> >  {
> > -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> > +       return unlikely(!preempt_count() &&
> > +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> >  }  
> 
> I'm not convinced (at all) that the preempt count is the right thing.
> 
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
> 
> But what about any synchronous trap handling?
> 
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).

I wonder if we should make it a rule to not allow page faults when
RESCHED_ALLOW is set? Yeah, we can preempt in page faults, but that's not
what the allow_resched() is about. Since the main purpose of that function,
according to the change log, is for kernel threads. Do kernel threads page
fault? (perhaps for vmalloc? but do we take locks in those cases?).

That is, treat allow_resched() like preempt_disable(). If we page fault
with "preempt_disable()" we usually complain about that (unless we do some
magic with *_nofault() functions).

Then we could just add checks in the page fault handlers to see if
allow_resched() is set, and if so, complain about it like we do with
preempt_disable in the might_fault() function.


-- Steve
Ankur Arora Sept. 11, 2023, 5:04 p.m. UTC | #19
Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
>
>> I was hoping that we'd have some generic way to deal with this where
>> we could just say "this thing is reschedulable", and get rid of - or
>> at least not increasingly add to - the cond_resched() mess.
>
> Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> required to know when/if we can preempt.
>
> The whole voluntary preempt model is basically the traditional
> co-operative preemption model and that fully relies on manual yields.

Yeah, but as Linus says, this means a lot of code is just full of
cond_resched(). For instance a loop the process_huge_page() uses
this pattern:

   for (...) {
       cond_resched();
       clear_page(i);

       cond_resched();
       clear_page(j);
   }

> The problem with the REP prefix (and Xen hypercalls) is that
> they're long running instructions and it becomes fundamentally
> impossible to put a cond_resched() in.
>
>> Yes. I'm starting to think that that the only sane solution is to
>> limit cases that can do this a lot, and the "instruciton pointer
>> region" approach would certainly work.
>
> From a code locality / I-cache POV, I think a sorted list of
> (non overlapping) ranges might be best.

Yeah, agreed. There are a few problems with doing that though.

I was thinking of using a check of this kind to schedule out when
it is executing in this "reschedulable" section:
        !preempt_count() && in_resched_function(regs->rip);

For preemption=full, this should mostly work.
For preemption=voluntary, though this'll only work with out-of-line
locks, not if the lock is inlined.

(Both, should have problems with __this_cpu_* and the like, but
maybe we can handwave that away with sparse/objtool etc.)

How expensive would be always having PREEMPT_COUNT=y?

--
ankur
Linus Torvalds Sept. 11, 2023, 8:50 p.m. UTC | #20
On Mon, 11 Sept 2023 at 09:48, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I wonder if we should make it a rule to not allow page faults when
> RESCHED_ALLOW is set?

I really think that user copies might actually be one of the prime targets.

Right now we special-case big user copes - see for example
copy_chunked_from_user().

But that's an example of exactly the problem this code has - we
literally make more complex - and objectively *WORSE* code just to
deal with "I want this to be interruptible".

So yes, we could limit RESCHED_ALLOW to not allow page faults, but big
user copies literally are one of the worst problems.

Another example of this this is just plain read/write. It's not a
problem in practice right now, because large pages are effectively
never used.

But just imagine what happens once filemap_read() actually does big folios?

Do you really want this code:

        copied = copy_folio_to_iter(folio, offset, bytes, iter);

to forever use the artificial chunking it does now?

And yes, right now it will still do things in one-page chunks in
copy_page_to_iter(). It doesn't even have cond_resched() - it's
currently in the caller, in filemap_read().

But just think about possible futures.

Now, one option really is to do what I think PeterZ kind of alluded to
- start deprecating PREEMPT_VOLUNTARY and PREEMPT_NONE entirely.

Except we've actually been *adding* to this whole mess, rather than
removing it. So we have actively *expanded* on that preemption choice
with PREEMPT_DYNAMIC.

That's actually reasonably recent, implying that distros really want
to still have the option.

And it seems like it's actually server people who want the "no
preemption" (and presumably avoid all the preempt count stuff entirely
- it's not necessarily the *preemption* that is the cost, it's the
incessant preempt count updates)

                            Linus
Linus Torvalds Sept. 11, 2023, 9:16 p.m. UTC | #21
On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Except we've actually been *adding* to this whole mess, rather than
> removing it. So we have actively *expanded* on that preemption choice
> with PREEMPT_DYNAMIC.

Actually, that config option makes no sense.

It makes the sched_cond() behavior conditional with a static call.

But all the *real* overhead is still there and unconditional (ie all
the preempt count updates and the "did it go down to zero and we need
to check" code).

That just seems stupid. It seems to have all the overhead of a
preemptible kernel, just not doing the preemption.

So I must be mis-reading this, or just missing something important.

The real cost seems to be

   PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT

and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
will end up with that, and thus both cases will have all the spinlock
preempt count stuff.

There must be some non-preempt_count cost that people worry about.

Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
possible, because this seems *so* pointless to me.

Somebody please hit me with a clue-bat to the noggin.

                Linus
Steven Rostedt Sept. 11, 2023, 10:20 p.m. UTC | #22
On Mon, 11 Sep 2023 13:50:53 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And it seems like it's actually server people who want the "no
> preemption" (and presumably avoid all the preempt count stuff entirely
> - it's not necessarily the *preemption* that is the cost, it's the
> incessant preempt count updates)

I'm sure there's some overhead with the preemption itself. With the
meltdown/spectre mitigations going into and out of the kernel does add some
more overhead. And finishing a system call before being preempted may give
some performance benefits for some micro benchmark out there.

Going out on a crazy idea, I wonder if we could get the compiler to help us
here. As all preempt disabled locations are static, and as for functions,
they can be called with preemption enabled or disabled. Would it be
possible for the compiler to mark all locations that need preemption disabled?

If a function is called in a preempt disabled section and also called in a
preempt enable section, it could make two versions of the function (one
where preemption is disabled and one where it is enabled).

Then all we would need is a look up table to know if preemption is safe or
not by looking at the instruction pointer.

Yes, I know this is kind of a wild idea, but I do believe it is possible.

The compiler wouldn't need to know of the concept of "preemption" just a
"make this location special, and keep functions called by that location
special and duplicate them if they are are called outside of this special
section".

 ;-)

-- Steve
Ankur Arora Sept. 11, 2023, 11:10 p.m. UTC | #23
Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 11 Sep 2023 13:50:53 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> And it seems like it's actually server people who want the "no
>> preemption" (and presumably avoid all the preempt count stuff entirely
>> - it's not necessarily the *preemption* that is the cost, it's the
>> incessant preempt count updates)
>
> I'm sure there's some overhead with the preemption itself. With the
> meltdown/spectre mitigations going into and out of the kernel does add some
> more overhead. And finishing a system call before being preempted may give
> some performance benefits for some micro benchmark out there.
>
> Going out on a crazy idea, I wonder if we could get the compiler to help us
> here. As all preempt disabled locations are static, and as for functions,
> they can be called with preemption enabled or disabled. Would it be
> possible for the compiler to mark all locations that need preemption disabled?

An even crazier version of that idea would be to have
preempt_disable/enable() demarcate regions, and the compiler putting all
of the preemption disabled region out-of-line to a special section.
Seems to me, that then we could do away to preempt_enable/disable()?
(Ignoring the preempt_count used in hardirq etc.)

This would allow preemption always, unless executing in the
preemption-disabled section.

Though I don't have any intuition for how much extra call overhead this
would add.

Ankur

> If a function is called in a preempt disabled section and also called in a
> preempt enable section, it could make two versions of the function (one
> where preemption is disabled and one where it is enabled).
>
> Then all we would need is a look up table to know if preemption is safe or
> not by looking at the instruction pointer.
>
> Yes, I know this is kind of a wild idea, but I do believe it is possible.
>
> The compiler wouldn't need to know of the concept of "preemption" just a
> "make this location special, and keep functions called by that location
> special and duplicate them if they are are called outside of this special
> section".
>
>  ;-)
>
> -- Steve
Steven Rostedt Sept. 11, 2023, 11:16 p.m. UTC | #24
On Mon, 11 Sep 2023 16:10:31 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> An even crazier version of that idea would be to have
> preempt_disable/enable() demarcate regions, and the compiler putting all
> of the preemption disabled region out-of-line to a special section.
> Seems to me, that then we could do away to preempt_enable/disable()?
> (Ignoring the preempt_count used in hardirq etc.)
> 

I thought about this too, but wasn't sure if it would be easier or harder
to implement. This would still require the duplicate functions (which I
guess would be the most difficult part).

> This would allow preemption always, unless executing in the
> preemption-disabled section.
> 
> Though I don't have any intuition for how much extra call overhead this
> would add.

I don't think this version would have as high of an overhead. You would get
a direct jump (which isn't bad as all speculation knows exactly where to
look), and it would improve the look up. No table, just a simple range
check.

-- Steve
Matthew Wilcox Sept. 12, 2023, 3:27 a.m. UTC | #25
On Mon, Sep 11, 2023 at 01:50:53PM -0700, Linus Torvalds wrote:
> Another example of this this is just plain read/write. It's not a
> problem in practice right now, because large pages are effectively
> never used.
> 
> But just imagine what happens once filemap_read() actually does big folios?
> 
> Do you really want this code:
> 
>         copied = copy_folio_to_iter(folio, offset, bytes, iter);
> 
> to forever use the artificial chunking it does now?
> 
> And yes, right now it will still do things in one-page chunks in
> copy_page_to_iter(). It doesn't even have cond_resched() - it's
> currently in the caller, in filemap_read().

Ah, um.  If you take a look in fs/iomap/buffered-io.c, you'll
see ...

iomap_write_iter:
        size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
                struct folio *folio;
                bytes = min(chunk - offset, iov_iter_count(i));
                if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
                copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);

So we do still cond_resched(), but we might go up to PMD_SIZE
between calls.  This is new code in 6.6 so it hasn't seen use by too
many users yet, but it's certainly bigger than the 16 pages used by
copy_chunked_from_user().  I honestly hadn't thought about preemption
latency.
Peter Zijlstra Sept. 12, 2023, 7:20 a.m. UTC | #26
On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote:
> On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Except we've actually been *adding* to this whole mess, rather than
> > removing it. So we have actively *expanded* on that preemption choice
> > with PREEMPT_DYNAMIC.
> 
> Actually, that config option makes no sense.
> 
> It makes the sched_cond() behavior conditional with a static call.
> 
> But all the *real* overhead is still there and unconditional (ie all
> the preempt count updates and the "did it go down to zero and we need
> to check" code).
> 
> That just seems stupid. It seems to have all the overhead of a
> preemptible kernel, just not doing the preemption.
> 
> So I must be mis-reading this, or just missing something important.
> 
> The real cost seems to be
> 
>    PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT
> 
> and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
> will end up with that, and thus both cases will have all the spinlock
> preempt count stuff.
> 
> There must be some non-preempt_count cost that people worry about.
> 
> Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
> possible, because this seems *so* pointless to me.
> 
> Somebody please hit me with a clue-bat to the noggin.

Well, I was about to reply to your previous email explaining this, but
this one time I did read more email..

Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops
out the schedule()/cond_resched() calls where appropriate.

This work was done by a distro (SuSE) and if they're willing to ship
this I'm thinking the overheads are acceptable to them.

For a significant number of workloads the real overhead is the extra
preepmtions themselves more than the counting -- but yes, the counting
is measurable, but probably in the noise compared to other some of the
other horrible things we have done the past years.

Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes,
deleting the other options are definitely an option.
Ingo Molnar Sept. 12, 2023, 7:38 a.m. UTC | #27
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote:
> > On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Except we've actually been *adding* to this whole mess, rather than
> > > removing it. So we have actively *expanded* on that preemption choice
> > > with PREEMPT_DYNAMIC.
> > 
> > Actually, that config option makes no sense.
> > 
> > It makes the sched_cond() behavior conditional with a static call.
> > 
> > But all the *real* overhead is still there and unconditional (ie all
> > the preempt count updates and the "did it go down to zero and we need
> > to check" code).
> > 
> > That just seems stupid. It seems to have all the overhead of a
> > preemptible kernel, just not doing the preemption.
> > 
> > So I must be mis-reading this, or just missing something important.
> > 
> > The real cost seems to be
> > 
> >    PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT
> > 
> > and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
> > will end up with that, and thus both cases will have all the spinlock
> > preempt count stuff.
> > 
> > There must be some non-preempt_count cost that people worry about.
> > 
> > Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
> > possible, because this seems *so* pointless to me.
> > 
> > Somebody please hit me with a clue-bat to the noggin.
> 
> Well, I was about to reply to your previous email explaining this, but 
> this one time I did read more email..
> 
> Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops 
> out the schedule()/cond_resched() calls where appropriate.
> 
> This work was done by a distro (SuSE) and if they're willing to ship this 
> I'm thinking the overheads are acceptable to them.
> 
> For a significant number of workloads the real overhead is the extra 
> preepmtions themselves more than the counting -- but yes, the counting is 
> measurable, but probably in the noise compared to other some of the other 
> horrible things we have done the past years.
> 
> Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes, 
> deleting the other options are definitely an option.

Yes, so my understanding is that distros generally worry more about 
macro-overhead, for example material changes to a random subset of key 
benchmarks that specific enterprise customers care about, and distros are 
not nearly as sensitive about micro-overhead that preempt_count() 
maintenance causes.

PREEMPT_DYNAMIC is basically a reflection of that: the desire to have only 
a single kernel image, but a boot-time toggle to differentiate between 
desktop and server loads and have CONFIG_PREEMPT (desktop) but also 
PREEMPT_VOLUNTARY behavior (server).

There's also the view that PREEMPT kernels are a bit more QA-friendly, 
because atomic code sequences are much better defined & enforced via kernel 
warnings. Without preempt_count we only have irqs-off warnings, that are 
only a small fraction of all critical sections in the kernel.

Ideally we'd be able to patch out most of the preempt_count maintenance 
overhead too - OTOH these days it's little more than noise on most CPUs, 
considering the kind of horrible security-workaround overhead we have on 
almost all x86 CPU types ... :-/

Thanks,

	Ingo
Peter Zijlstra Sept. 12, 2023, 8:26 a.m. UTC | #28
On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
> >
> >> I was hoping that we'd have some generic way to deal with this where
> >> we could just say "this thing is reschedulable", and get rid of - or
> >> at least not increasingly add to - the cond_resched() mess.
> >
> > Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> > required to know when/if we can preempt.
> >
> > The whole voluntary preempt model is basically the traditional
> > co-operative preemption model and that fully relies on manual yields.
> 
> Yeah, but as Linus says, this means a lot of code is just full of
> cond_resched(). For instance a loop the process_huge_page() uses
> this pattern:
> 
>    for (...) {
>        cond_resched();
>        clear_page(i);
> 
>        cond_resched();
>        clear_page(j);
>    }

Yeah, that's what co-operative preemption gets you.

> > The problem with the REP prefix (and Xen hypercalls) is that
> > they're long running instructions and it becomes fundamentally
> > impossible to put a cond_resched() in.
> >
> >> Yes. I'm starting to think that that the only sane solution is to
> >> limit cases that can do this a lot, and the "instruciton pointer
> >> region" approach would certainly work.
> >
> > From a code locality / I-cache POV, I think a sorted list of
> > (non overlapping) ranges might be best.
> 
> Yeah, agreed. There are a few problems with doing that though.
> 
> I was thinking of using a check of this kind to schedule out when
> it is executing in this "reschedulable" section:
>         !preempt_count() && in_resched_function(regs->rip);
> 
> For preemption=full, this should mostly work.
> For preemption=voluntary, though this'll only work with out-of-line
> locks, not if the lock is inlined.
> 
> (Both, should have problems with __this_cpu_* and the like, but
> maybe we can handwave that away with sparse/objtool etc.)

So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
thing, and then only search the range when TIF flag is set.

And I'm thinking it might be a good idea to have objtool validate the
range only contains simple instructions, the moment it contains control
flow I'm thinking it's too complicated.

> How expensive would be always having PREEMPT_COUNT=y?

Effectively I think that is true today. At the very least Debian and
SuSE (I can't find a RHEL .config in a hurry but I would think they too)
ship with PREEMPT_DYNAMIC=y.

Mel, I'm sure you ran numbers at the time (you always do), what if any
was the measured overhead from PREEMPT_DYNAMIC vs 'regular' voluntary
preemption?
Phil Auld Sept. 12, 2023, 12:24 p.m. UTC | #29
On Tue, Sep 12, 2023 at 10:26:06AM +0200 Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
> > 
> > How expensive would be always having PREEMPT_COUNT=y?
> 
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.
>

Yes, RHEL too.

Cheers,
Phil

--
Matthew Wilcox Sept. 12, 2023, 12:33 p.m. UTC | #30
On Tue, Sep 12, 2023 at 10:26:06AM +0200, Peter Zijlstra wrote:
> > How expensive would be always having PREEMPT_COUNT=y?
> 
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.

$ grep PREEMPT uek-rpm/ol9/config-x86_64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

$ grep PREEMPT uek-rpm/ol9/config-aarch64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Linus Torvalds Sept. 12, 2023, 4:20 p.m. UTC | #31
On Mon, 11 Sept 2023 at 20:27, Matthew Wilcox <willy@infradead.org> wrote:
>
> So we do still cond_resched(), but we might go up to PMD_SIZE
> between calls.  This is new code in 6.6 so it hasn't seen use by too
> many users yet, but it's certainly bigger than the 16 pages used by
> copy_chunked_from_user().  I honestly hadn't thought about preemption
> latency.

The thing about cond_resched() is that you literally won't get anybody
who complains until the big page case is common enough that it hits
special people.

This is also a large part of why I dislike cond_resched() a lot. It's
not just that it's sprinkled randomly in our code-base, it's that it's
*found* and added so randomly.

Some developers will look at code and say "this may be a long loop"
and add it without any numbers. It's rare, but it happens.

And other than that it usually is something like the RT people who
have the latency trackers, and one particular load that they use for
testing.

Oh well. Enough kvetching. I'm not happy about it, but in the end it's
a small annoyance, not a big issue.

                Linus
Linus Torvalds Sept. 12, 2023, 4:30 p.m. UTC | #32
On Mon, 11 Sept 2023 at 15:20, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Going out on a crazy idea, I wonder if we could get the compiler to help us
> here. As all preempt disabled locations are static, and as for functions,
> they can be called with preemption enabled or disabled. Would it be
> possible for the compiler to mark all locations that need preemption disabled?

Definitely not.

Those preempt-disabled areas aren't static, for one thing. Any time
you take any exception in kernel space, your exception handles is all
dynamically preemtible or not, possibly depending on architecture
details)

Yes, most exception handlers then have magic rules: page faults won't
get past a particular point if they happened while not preemptible,
for example. And interrupts will disable preemption themselves.

But we have a ton of code that runs lots of subtle code in exception
handlers that is very architecture-dependent, whether it is things
like unaligned fixups, or instruction rewriting things for dynamic
calls, or a lot of very grotty code.

Most (all?) of it could probably be made to be non-preemptible, but
it's a lot of code for a lot of architectures, and it's not the
trivial kind.

And that's ignoring all the code that is run in just regular process
context with no exceptions that is sometimes run under spinlocks, and
sometimes not. There's a *lot* of it. Think something as trivial as
memcpy(), but also kmalloc() or any number of stuff that is just
random support code that can be used from atomic (non-preemtible)
context.

So even if we could rely on magic compiler support that doesn't exist
- which we can't - it's not evenb *remotely* as static as you seem to
think it is.

                 Linus
Thomas Gleixner Sept. 18, 2023, 11:42 p.m. UTC | #33
On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>> > The problem with the REP prefix (and Xen hypercalls) is that
>> > they're long running instructions and it becomes fundamentally
>> > impossible to put a cond_resched() in.
>> >
>> >> Yes. I'm starting to think that that the only sane solution is to
>> >> limit cases that can do this a lot, and the "instruciton pointer
>> >> region" approach would certainly work.
>> >
>> > From a code locality / I-cache POV, I think a sorted list of
>> > (non overlapping) ranges might be best.
>> 
>> Yeah, agreed. There are a few problems with doing that though.
>> 
>> I was thinking of using a check of this kind to schedule out when
>> it is executing in this "reschedulable" section:
>>         !preempt_count() && in_resched_function(regs->rip);
>> 
>> For preemption=full, this should mostly work.
>> For preemption=voluntary, though this'll only work with out-of-line
>> locks, not if the lock is inlined.
>> 
>> (Both, should have problems with __this_cpu_* and the like, but
>> maybe we can handwave that away with sparse/objtool etc.)
>
> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
> thing, and then only search the range when TIF flag is set.
>
> And I'm thinking it might be a good idea to have objtool validate the
> range only contains simple instructions, the moment it contains control
> flow I'm thinking it's too complicated.

Can we take a step back and look at the problem from a scheduling
perspective?

The basic operation of a non-preemptible kernel is time slice
scheduling, which means that a task can run more or less undisturbed for
a full time slice once it gets on the CPU unless it schedules away
voluntary via a blocking operation.

This works pretty well as long as everything runs in userspace as the
preemption points in the return to user space path are independent of
the preemption model.

These preemption points handle both time slice exhaustion and priority
based preemption.

With PREEMPT=NONE these are the only available preemption points.

That means that kernel code can run more or less indefinitely until it
schedules out or returns to user space, which is obviously not possible
for kernel threads.

To prevent starvation the kernel gained voluntary preemption points,
i.e. cond_resched(), which has to be added manually to code as a
developer sees fit.

Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
additional preemption points. might_resched() utilizes the existing
might_sched() debug points, which are in code paths which might block on
a contended resource. These debug points are mostly in core and
infrastructure code and are in code paths which can block anyway. The
only difference is that they allow preemption even when the resource is
uncontended.

Additionally we have PREEMPT=FULL which utilizes every zero transition
of preeempt_count as a potential preemption point.

Now we have the situation of long running data copies or data clear
operations which run fully in hardware, but can be interrupted. As the
interrupt return to kernel mode does not preempt in the NONE and
VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
chunk size and adding cond_reched() again.

That's ugly and does not work for long lasting hardware operations so we
ended up with the suggestion of TIF_ALLOW_RESCHED to work around
that. But again this needs to be manually annotated in the same way as a
IP range based preemption scheme requires annotation.

TBH. I detest all of this.

Both cond_resched() and might_sleep/sched() are completely random
mechanisms as seen from time slice operation and the data chunk based
mechanism is just heuristics which works as good as heuristics tend to
work. allow_resched() is not any different and IP based preemption
mechanism are not going to be any better.

The approach here is: Prevent the scheduler to make decisions and then
mitigate the fallout with heuristics.

That's just backwards as it moves resource control out of the scheduler
into random code which has absolutely no business to do resource
control.

We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
held sections became preemtible caused even more preemption activity
than on a PREEMPT=FULL kernel. The worst side effect of that was
extensive lock contention.

The way how we addressed that was to add a lazy preemption mode, which
tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
preempt tasks which all belong to the SCHED_OTHER scheduling class. This
works pretty well and gains back a massive amount of performance for the
non-realtime throughput oriented tasks without affecting the
schedulability of real-time tasks at all. IOW, it does not take control
away from the scheduler. It cooperates with the scheduler and leaves the
ultimate decisions to it.

I think we can do something similar for the problem at hand, which
avoids most of these heuristic horrors and control boundary violations.

The main issue is that long running operations do not honour the time
slice and we work around that with cond_resched() and now have ideas
with this new TIF bit and IP ranges.

None of that is really well defined in respect to time slices. In fact
its not defined at all versus any aspect of scheduling behaviour.

What about the following:

   1) Keep preemption count and the real preemption points enabled
      unconditionally. That's not more overhead than the current
      DYNAMIC_PREEMPT mechanism as long as the preemption count does not
      go to zero, i.e. the folded NEED_RESCHED bit stays set.

      From earlier experiments I know that the overhead of preempt_count
      is minimal and only really observable with micro benchmarks.
      Otherwise it ends up in the noise as long as the slow path is not
      taken.

      I did a quick check comparing a plain inc/dec pair vs. the
      DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
      in the non-conclusive noise.

      20 years ago this was a real issue because we did not have:

       - the folding of NEED_RESCHED into the preempt count
       
       - the cacheline optimizations which make the preempt count cache
         pretty much always cache hot

       - the hardware was way less capable

      I'm not saying that preempt_count is completely free today as it
      obviously adds more text and affects branch predictors, but as the
      major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
      acceptable and tolerable tradeoff.

   2) When the scheduler wants to set NEED_RESCHED due it sets
      NEED_RESCHED_LAZY instead which is only evaluated in the return to
      user space preemption points.

      As NEED_RESCHED_LAZY is not folded into the preemption count the
      preemption count won't become zero, so the task can continue until
      it hits return to user space.

      That preserves the existing behaviour.

   3) When the scheduler tick observes that the time slice is exhausted,
      then it folds the NEED_RESCHED bit into the preempt count which
      causes the real preemption points to actually preempt including
      the return from interrupt to kernel path.

      That even allows the scheduler to enforce preemption for e.g. RT
      class tasks without changing anything else.

      I'm pretty sure that this gets rid of cond_resched(), which is an
      impressive list of instances:

	./drivers	 392
	./fs		 318
	./mm		 189
	./kernel	 184
	./arch		  95
	./net		  83
	./include	  46
	./lib		  36
	./crypto	  16
	./sound		  16
	./block		  11
	./io_uring	  13
	./security	  11
	./ipc		   3

      That list clearly documents that the majority of these
      cond_resched() invocations is in code which neither should care
      nor should have any influence on the core scheduling decision
      machinery.

I think it's worth a try as it just fits into the existing preemption
scheme, solves the issue of long running kernel functions, prevents
invalid preemption and can utilize the existing instrumentation and
debug infrastructure.

Most importantly it gives control back to the scheduler and does not
make it depend on the mercy of cond_resched(), allow_resched() or
whatever heuristics sprinkled all over the kernel.

To me this makes a lot of sense, but I might be on the completely wrong
track. Se feel free to tell me that I'm completely nuts and/or just not
seeing the obvious.

Thanks,

        tglx
Linus Torvalds Sept. 19, 2023, 1:57 a.m. UTC | #34
On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> What about the following:
>
>    1) Keep preemption count and the real preemption points enabled
>       unconditionally.

Well, it's certainly the simplest solution, and gets rid of not just
the 'rep string' issue, but gets rid of all the cond_resched() hackery
entirely.

>       20 years ago this was a real issue because we did not have:
>
>        - the folding of NEED_RESCHED into the preempt count
>
>        - the cacheline optimizations which make the preempt count cache
>          pretty much always cache hot
>
>        - the hardware was way less capable
>
>       I'm not saying that preempt_count is completely free today as it
>       obviously adds more text and affects branch predictors, but as the
>       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
>       acceptable and tolerable tradeoff.

Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in
most distros does speak for just admitting that the PREEMPT_NONE /
VOLUNTARY approach isn't actually used, and is only causing pain.

>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.

Is this just to try to emulate the existing PREEMPT_NONE behavior?

If the new world order is that the time slice is always honored, then
the "this might be a latency issue" goes away. Good.

And we'd also get better coverage for the *debug* aim of
"might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
PREEMPT_COUNT always existing.

But because the latency argument is gone, the "might_resched()" should
then just be removed entirely from "might_sleep()", so that
might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

That argues for your suggestion too, since we had a performance issue
due to "might_sleep()" _not_ being just a debug thing, and pointlessly
causing a reschedule in a place where reschedules were _allowed_, but
certainly much less than optimal.

Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
'might_sleep()' in get_mmap_lock_carefully()").

However, that does bring up an issue: even with full preemption, there
are certainly places where we are *allowed* to schedule (when the
preempt count is zero), but there are also some places that are
*better* than other places to schedule (for example, when we don't
hold any other locks).

So, I do think that if we just decide to go "let's just always be
preemptible", we might still have points in the kernel where
preemption might be *better* than in others points.

But none of might_resched(), might_sleep() _or_ cond_resched() are
necessarily that kind of "this is a good point" thing. They come from
a different background.

So what I think what you are saying is that we'd have the following situation:

 - scheduling at "return to user space" is presumably always a good thing.

A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
whatever) would cover that, and would give us basically the existing
CONFIG_PREEMPT_NONE behavior.

So a config variable (either compile-time with PREEMPT_NONE or a
dynamic one with DYNAMIC_PREEMPT set to none) would make any external
wakeup only set that bit.

And then a "fully preemptible low-latency desktop" would set the
preempt-count bit too.

 - but the "timeslice over" case would always set the
preempt-count-bit, regardless of any config, and would guarantee that
we have reasonable latencies.

This all makes cond_resched() (and might_resched()) pointless, and
they can just go away.

Then the question becomes whether we'd want to introduce a *new*
concept, which is a "if you are going to schedule, do it now rather
than later, because I'm taking a lock, and while it's a preemptible
lock, I'd rather not sleep while holding this resource".

I suspect we want to avoid that for now, on the assumption that it's
hopefully not a problem in practice (the recently addressed problem
with might_sleep() was that it actively *moved* the scheduling point
to a bad place, not that scheduling could happen there, so instead of
optimizing scheduling, it actively pessimized it). But I thought I'd
mention it.

Anyway, I'm definitely not opposed. We'd get rid of a config option
that is presumably not very widely used, and we'd simplify a lot of
issues, and get rid of all these badly defined "cond_preempt()"
things.

                Linus
Andy Lutomirski Sept. 19, 2023, 3:21 a.m. UTC | #35
On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> On preempt_model_none() or preempt_model_voluntary() configurations
> rescheduling of kernel threads happens only when they allow it, and
> only at explicit preemption points, via calls to cond_resched() or
> similar.
>
> That leaves out contexts where it is not convenient to periodically
> call cond_resched() -- for instance when executing a potentially long
> running primitive (such as REP; STOSB.)
>

So I said this not too long ago in the context of Xen PV, but maybe it's time to ask it in general:

Why do we support anything other than full preempt?  I can think of two reasons, neither of which I think is very good:

1. Once upon a time, tracking preempt state was expensive.  But we fixed that.

2. Folklore suggests that there's a latency vs throughput tradeoff, and serious workloads, for some definition of serious, want throughput, so they should run without full preemption.

I think #2 is a bit silly.  If you want throughput, and you're busy waiting for a CPU that wants to run you, but it's not because it's running some low-priority non-preemptible thing (because preempt is set to none or volunary), you're not getting throughput.  If you want to get keep some I/O resource busy to get throughput, but you have excessive latency getting scheduled, you don't get throughput.

If the actual problem is that there's a workload that performs better when scheduling is delayed (which preempt=none and preempt=volunary do, essentialy at random), then maybe someone should identify that workload and fix the scheduler.

So maybe we should just very strongly encourage everyone to run with full preempt and simplify the kernel?
Ingo Molnar Sept. 19, 2023, 7:21 a.m. UTC | #36
* Thomas Gleixner <tglx@linutronix.de> wrote:

> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.

Just to complete this nice new entry to Documentation/sched/: in 
PREEMPT=FULL there's also IRQ-return driven preemption of kernel-mode code, 
at almost any instruction boundary the hardware allows, in addition to the 
preemption driven by regular zero transition of preempt_count in 
syscall/kthread code.

Thanks,

	Ingo
Ingo Molnar Sept. 19, 2023, 8:03 a.m. UTC | #37
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > What about the following:
> >
> >    1) Keep preemption count and the real preemption points enabled
> >       unconditionally.
> 
> Well, it's certainly the simplest solution, and gets rid of not just
> the 'rep string' issue, but gets rid of all the cond_resched() hackery
> entirely.
> 
> >       20 years ago this was a real issue because we did not have:
> >
> >        - the folding of NEED_RESCHED into the preempt count
> >
> >        - the cacheline optimizations which make the preempt count cache
> >          pretty much always cache hot
> >
> >        - the hardware was way less capable
> >
> >       I'm not saying that preempt_count is completely free today as it
> >       obviously adds more text and affects branch predictors, but as the
> >       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> >       acceptable and tolerable tradeoff.
> 
> Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most 
> distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY 
> approach isn't actually used, and is only causing pain.

The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server 
distros - and that's the behavior that enterprise distros truly cared 
about.

Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise' 
category for all major distros I'd say.

And that's what Thomas's proposal achieves: keep the nicely execution-batched 
NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the 
latency advantages of fully-preemptible kernel code for RT and critical 
tasks.

So I'm fully on board with this. It would reduce the number of preemption 
variants to just two: regular kernel and PREEMPT_RT. Yummie!

> >    2) When the scheduler wants to set NEED_RESCHED due it sets
> >       NEED_RESCHED_LAZY instead which is only evaluated in the return to
> >       user space preemption points.
> 
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

Yes: I'd guesstimate that the batching caused by timeslice-laziness that is 
naturally part of NONE/VOLUNTARY resolves ~90%+ of observable 
macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT.

> If the new world order is that the time slice is always honored, then the 
> "this might be a latency issue" goes away. Good.
> 
> And we'd also get better coverage for the *debug* aim of "might_sleep()" 
> and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always 
> existing.
> 
> But because the latency argument is gone, the "might_resched()" should 
> then just be removed entirely from "might_sleep()", so that might_sleep() 
> would *only* be that DEBUG_ATOMIC_SLEEP thing.

Correct. And that's even a minor code generation advantage, as we wouldn't 
have these additional hundreds of random/statistical preemption checks.

> That argues for your suggestion too, since we had a performance issue due 
> to "might_sleep()" _not_ being just a debug thing, and pointlessly 
> causing a reschedule in a place where reschedules were _allowed_, but 
> certainly much less than optimal.
> 
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid 
> 'might_sleep()' in get_mmap_lock_carefully()").

4542057e18ca is arguably kind of a workaround though - and with the 
preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency 
advantages *and* the execution-batching performance advantages of 
NONE/VOLUNTARY that 4542057e18ca exposed.

> However, that does bring up an issue: even with full preemption, there 
> are certainly places where we are *allowed* to schedule (when the preempt 
> count is zero), but there are also some places that are *better* than 
> other places to schedule (for example, when we don't hold any other 
> locks).
> 
> So, I do think that if we just decide to go "let's just always be 
> preemptible", we might still have points in the kernel where preemption 
> might be *better* than in others points.

So in the broadest sense we have 3 stages of pending preemption:

   NEED_RESCHED_LAZY
   NEED_RESCHED_SOON
   NEED_RESCHED_NOW

And we'd transition:

  - from    0 -> SOON when an eligible task is woken up,
  - from LAZY -> SOON when current timeslice is exhausted,
  - from SOON -> NOW  when no locks/resources are held.

  [ With a fast-track for RT or other urgent tasks to enter NOW immediately. ]

On the regular kernels it's probably not worth modeling the SOON/NOW split, 
as we'd have to track the depth of sleeping locks as well, which we don't 
do right now.

On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we 
are aware of locking depth already and it would be relatively cheap to 
check for it on natural 0-preempt_count boundaries.


> But none of might_resched(), might_sleep() _or_ cond_resched() are 
> necessarily that kind of "this is a good point" thing. They come from a 
> different background.

Correct, they come from two sources:

 - They are hundreds of points that we know are 'technically correct' 
   preemption points, and they break up ~90% of long latencies by brute 
   force & chance.

 - Explicitly identified problem points that added a cond_resched() or its 
   equivalent. These are rare and also tend to bitrot, because *removing* 
   them is always more risky than adding them, so they tend to accumulate.

> So what I think what you are saying is that we'd have the following 
> situation:
> 
>  - scheduling at "return to user space" is presumably always a good thing.
> 
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
> 
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
> 
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

I'd even argue that we only need two preemption modes, and that 'fully 
preemptible low-latency desktop' is an artifact of poor latencies on 
PREEMPT_NONE.

Ie. in the long run - after a careful period of observing performance 
regressions and other dragons - we'd only have *two* preemption modes left:

   !PREEMPT_RT     # regular kernel. Single default behavior.
   PREEMPT_RT=y    # -rt kernel, because rockets, satellites & cars matter.

Any other application level preemption preferences can be expressed via 
scheduling policies & priorities.

Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or 
PREEMPT_NONE in any of their variants, probably not even as runtime knobs.

People who want shorter timeslices can set shorter timeslices, and people 
who want immediate preemption of certain tasks can manage priorities.

>  - but the "timeslice over" case would always set the preempt-count-bit, 
> regardless of any config, and would guarantee that we have reasonable 
> latencies.

Yes. Probably a higher nice-priority task becoming runnable would cause 
immediate preemption too, in addition to RT tasks.

Ie. the execution batching would be for same-priority groups of SCHED_OTHER 
tasks.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

Yep.

> Then the question becomes whether we'd want to introduce a *new* concept, 
> which is a "if you are going to schedule, do it now rather than later, 
> because I'm taking a lock, and while it's a preemptible lock, I'd rather 
> not sleep while holding this resource".

Something close to this concept is naturally available on PREEMPT_RT 
kernels, which only use a single central lock primitive (rt_mutex), but it 
would have be added explicitly for regular kernels.

We could do the following intermediate step:

 - Remove all the random cond_resched() points such as might_sleep()
 - Turn all explicit cond_resched() points into 'ideal point to reschedule'.

 - Maybe even rename it from cond_resched() to resched_point(), to signal 
   the somewhat different role.

While cond_resched() and resched_point() are not 100% matches, they are 
close enough, as most existing cond_resched() points were added to places 
that cause the least amount of disruption with held resources.

But I think it would be better to add resched_point() as a new API, and add 
it to places where there's a performance benefit. Clean slate, 
documentation, and all that.

> I suspect we want to avoid that for now, on the assumption that it's 
> hopefully not a problem in practice (the recently addressed problem with 
> might_sleep() was that it actively *moved* the scheduling point to a bad 
> place, not that scheduling could happen there, so instead of optimizing 
> scheduling, it actively pessimized it). But I thought I'd mention it.
> 
> Anyway, I'm definitely not opposed. We'd get rid of a config option that 
> is presumably not very widely used, and we'd simplify a lot of issues, 
> and get rid of all these badly defined "cond_preempt()" things.

I think we can get rid of *all* the preemption model Kconfig knobs, except 
PREEMPT_RT. :-)

Thanks,

	Ingo
Ingo Molnar Sept. 19, 2023, 8:43 a.m. UTC | #38
* Ingo Molnar <mingo@kernel.org> wrote:

> > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most 
> > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY 
> > approach isn't actually used, and is only causing pain.
> 
> The macro-behavior of NONE/VOLUNTARY is still used & relied upon in 
> server distros - and that's the behavior that enterprise distros truly 
> cared about.
> 
> Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 
> 'noise' category for all major distros I'd say.
> 
> And that's what Thomas's proposal achieves: keep the nicely 
> execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER 
> tasks, while having the latency advantages of fully-preemptible kernel 
> code for RT and critical tasks.
> 
> So I'm fully on board with this. It would reduce the number of preemption 
> variants to just two: regular kernel and PREEMPT_RT. Yummie!

As an additional side note: with various changes such as EEVDF the 
scheduler is a lot less preemption-happy these days, without wrecking 
latencies & timeslice distribution.

So in principle we might not even need the NEED_RESCHED_LAZY extra bit, 
which -rt uses as a kind of additional layer to make sure they don't change 
scheduling policy.

Ie. a modern scheduler might have mooted much of this change:

   4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")

... because now we'll only reschedule on timeslice exhaustion, or if a task 
comes in with a big deadline deficit.

And even the deadline-deficit wakeup preemption can be turned off further 
with:

    $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features

And we are considering making that the default behavior for same-prio tasks 
- basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which 
should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

Thanks,

	Ingo
Thomas Gleixner Sept. 19, 2023, 9:20 a.m. UTC | #39
On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:

> Why do we support anything other than full preempt?  I can think of
> two reasons, neither of which I think is very good:
>
> 1. Once upon a time, tracking preempt state was expensive.  But we fixed that.
>
> 2. Folklore suggests that there's a latency vs throughput tradeoff,
>    and serious workloads, for some definition of serious, want
>    throughput, so they should run without full preemption.

It's absolutely not folklore. Run to completion is has well known
benefits as it avoids contention and avoids the overhead of scheduling
for a large amount of scenarios.

We've seen that painfully in PREEMPT_RT before we came up with the
concept of lazy preemption for throughput oriented tasks.

Thanks,

        tglx
Ingo Molnar Sept. 19, 2023, 9:49 a.m. UTC | #40
* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> > On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> 
> > Why do we support anything other than full preempt?  I can think of
> > two reasons, neither of which I think is very good:
> >
> > 1. Once upon a time, tracking preempt state was expensive.  But we fixed that.
> >
> > 2. Folklore suggests that there's a latency vs throughput tradeoff,
> >    and serious workloads, for some definition of serious, want
> >    throughput, so they should run without full preemption.
> 
> It's absolutely not folklore. Run to completion is has well known 
> benefits as it avoids contention and avoids the overhead of scheduling 
> for a large amount of scenarios.
> 
> We've seen that painfully in PREEMPT_RT before we came up with the 
> concept of lazy preemption for throughput oriented tasks.

Yeah, for a large majority of workloads reduction in preemption increases 
batching and improves cache locality. Most scalability-conscious enterprise 
users want longer timeslices & better cache locality, not shorter 
timeslices with spread out cache use.

There's microbenchmarks that fit mostly in cache that benefit if work is 
immediately processed by freshly woken tasks - but that's not true for most 
workloads with a substantial real-life cache footprint.

Thanks,

	Ingo
Thomas Gleixner Sept. 19, 2023, 12:30 p.m. UTC | #41
Linus!

On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
>>    2) When the scheduler wants to set NEED_RESCHED due it sets
>>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>>       user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

To some extent yes.

> If the new world order is that the time slice is always honored, then
> the "this might be a latency issue" goes away. Good.

That's the point.

> And we'd also get better coverage for the *debug* aim of
> "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
> PREEMPT_COUNT always existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that
> might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

True. And this gives the scheduler the flexibility to enforce preemption
under certain conditions, e.g. when a task with RT scheduling class or a
task with a sporadic event handler is woken up. That's what VOLUNTARY
tries to achieve with all the might_sleep()/might_resched() magic.

> That argues for your suggestion too, since we had a performance issue
> due to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").

Awesome.

> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the
> preempt count is zero), but there are also some places that are
> *better* than other places to schedule (for example, when we don't
> hold any other locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where
> preemption might be *better* than in others points.
>
> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from
> a different background.

They are subject to subsystem/driver specific preferences and therefore
biased towards a certain usage scenario, which is not necessarily to the
benefit of everyone else.

> So what I think what you are saying is that we'd have the following situation:
>
>  - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

Correct.

>  - but the "timeslice over" case would always set the
> preempt-count-bit, regardless of any config, and would guarantee that
> we have reasonable latencies.

Yes. That's the reasoning.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

:)

So the decision matrix would be:

                Ret2user        Ret2kernel      PreemptCnt=0

NEED_RESCHED       Y                Y               Y 
LAZY_RESCHED       Y                N               N

That is completely independent of the preemption model and the
differentiation of the preemption models happens solely at the scheduler
level:

PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
slice where it sets NEED_RESCHED.

PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
tasks or sporadic event tasks sets NEED_RESCHED too.

PREEMPT_FULL always sets NEED_RESCHED like today.

We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
only end up with two variants or even subsume PREEMPT_FULL into that
model because that's what is closer to the RT LAZY preempt behaviour,
which has two goals:

      1) Make low latency guarantees for RT workloads

      2) Preserve the throughput for non-RT workloads

But in any case this decision happens solely in the core scheduler code
and nothing outside of it needs to be changed.

So we not only get rid of the cond/might_resched() muck, we also get rid
of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
The only place which still needs that runtime tweaking is the scheduler
itself.

Though it just occured to me that there are dragons lurking:

arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
arch/um/Kconfig:        select ARCH_NO_PREEMPT

So we have four architectures which refuse to enable preemption points,
i.e. the only model they allow is NONE and they rely on cond_resched()
for breaking large computations.

But they support PREEMPT_COUNT, so we might get away with a reduced
preemption point coverage:

                Ret2user        Ret2kernel      PreemptCnt=0

NEED_RESCHED       Y                N               Y 
LAZY_RESCHED       Y                N               N

i.e. the only difference is that Ret2kernel is not a preemption
point. That's where the scheduler tick enforcement of the time slice
happens.

It still might work out good enough and if not then it should not be
real rocket science to add that Ret2kernel preemption point to cure it.

> Then the question becomes whether we'd want to introduce a *new*
> concept, which is a "if you are going to schedule, do it now rather
> than later, because I'm taking a lock, and while it's a preemptible
> lock, I'd rather not sleep while holding this resource".
>
> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem
> with might_sleep() was that it actively *moved* the scheduling point
> to a bad place, not that scheduling could happen there, so instead of
> optimizing scheduling, it actively pessimized it). But I thought I'd
> mention it.

I think we want to avoid that completely and if this becomes an issue,
we rather be smart about it at the core level.

It's trivial enough to have a per task counter which tells whether a
preemtible lock is held (or about to be acquired) or not. Then the
scheduler can take that hint into account and decide to grant a
timeslice extension once in the expectation that the task leaves the
lock held section soonish and either returns to user space or schedules
out. It still can enforce it later on.

We really want to let the scheduler decide and rather give it proper
hints at the conceptual level instead of letting developers make random
decisions which might work well for a particular use case and completely
suck for the rest. I think we wasted enough time already on those.

> Anyway, I'm definitely not opposed. We'd get rid of a config option
> that is presumably not very widely used, and we'd simplify a lot of
> issues, and get rid of all these badly defined "cond_preempt()"
> things.

Hmm. Didn't I promise a year ago that I won't do further large scale
cleanups and simplifications beyond printk.

Maybe I get away this time with just suggesting it. :)

Thanks,

        tglx
Matthew Wilcox Sept. 19, 2023, 1 p.m. UTC | #42
On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
> Though it just occured to me that there are dragons lurking:
> 
> arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig:        select ARCH_NO_PREEMPT

Sounds like three-and-a-half architectures which could be queued up for
removal right behind ia64 ...

I suspect none of these architecture maintainers have any idea there's a
problem.  Look at commit 87a4c375995e and the discussion in
https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/

Let's cc those maintainers so they can remove this and fix whatever
breaks.
Thomas Gleixner Sept. 19, 2023, 1:25 p.m. UTC | #43
Ingo!

On Tue, Sep 19 2023 at 10:03, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> Then the question becomes whether we'd want to introduce a *new* concept, 
>> which is a "if you are going to schedule, do it now rather than later, 
>> because I'm taking a lock, and while it's a preemptible lock, I'd rather 
>> not sleep while holding this resource".
>
> Something close to this concept is naturally available on PREEMPT_RT 
> kernels, which only use a single central lock primitive (rt_mutex), but it 
> would have be added explicitly for regular kernels.
>
> We could do the following intermediate step:
>
>  - Remove all the random cond_resched() points such as might_sleep()
>  - Turn all explicit cond_resched() points into 'ideal point to reschedule'.
>
>  - Maybe even rename it from cond_resched() to resched_point(), to signal 
>    the somewhat different role.
>
> While cond_resched() and resched_point() are not 100% matches, they are 
> close enough, as most existing cond_resched() points were added to places 
> that cause the least amount of disruption with held resources.
>
> But I think it would be better to add resched_point() as a new API, and add 
> it to places where there's a performance benefit. Clean slate, 
> documentation, and all that.

Lets not go there. You just replace one magic mushroom with a different
flavour. We want to get rid of them completely.

The whole point is to let the scheduler decide and give it enough
information to make informed decisions.

So with the LAZY scheme in effect, there is no real reason to have these
extra points and I rather add task::sleepable_locks_held and do that
accounting in the relevant lock/unlock paths. Based on that the
scheduler can decide whether it grants a time slice expansion or just
says no.

That's extremly cheap and well defined.

You can document the hell out of resched_point(), but it won't be any
different from the existing ones and always subject to personal
preference and goals and its going to be sprinkled all over the place
just like the existing ones. So where is the gain?

Thanks,

        tglx
Geert Uytterhoeven Sept. 19, 2023, 1:34 p.m. UTC | #44
Hi Willy,

On Tue, Sep 19, 2023 at 3:01 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
> > Though it just occured to me that there are dragons lurking:
> >
> > arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> > arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> > arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> > arch/um/Kconfig:        select ARCH_NO_PREEMPT
>
> Sounds like three-and-a-half architectures which could be queued up for
> removal right behind ia64 ...
>
> I suspect none of these architecture maintainers have any idea there's a
> problem.  Look at commit 87a4c375995e and the discussion in
> https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/

These links don't really point out there is a grave problem?

> Let's cc those maintainers so they can remove this and fix whatever
> breaks.

Gr{oetje,eeting}s,

                        Geert
John Paul Adrian Glaubitz Sept. 19, 2023, 1:37 p.m. UTC | #45
On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote:
> On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
> > Though it just occured to me that there are dragons lurking:
> > 
> > arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> > arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> > arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> > arch/um/Kconfig:        select ARCH_NO_PREEMPT
> 
> Sounds like three-and-a-half architectures which could be queued up for
> removal right behind ia64 ...

The agreement to kill off ia64 wasn't an invitation to kill off other stuff
that people are still working on! Can we please not do this?

Thanks,
Adrian
Peter Zijlstra Sept. 19, 2023, 1:42 p.m. UTC | #46
On Tue, Sep 19, 2023 at 03:37:24PM +0200, John Paul Adrian Glaubitz wrote:
> On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote:
> > On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
> > > Though it just occured to me that there are dragons lurking:
> > > 
> > > arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> > > arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> > > arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> > > arch/um/Kconfig:        select ARCH_NO_PREEMPT
> > 
> > Sounds like three-and-a-half architectures which could be queued up for
> > removal right behind ia64 ...
> 
> The agreement to kill off ia64 wasn't an invitation to kill off other stuff
> that people are still working on! Can we please not do this?

If you're working on one of them, then surely it's a simple matter of
working on adding CONFIG_PREEMPT support :-)
Thomas Gleixner Sept. 19, 2023, 1:43 p.m. UTC | #47
On Tue, Sep 19 2023 at 10:43, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
> Ie. a modern scheduler might have mooted much of this change:
>
>    4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")
>
> ... because now we'll only reschedule on timeslice exhaustion, or if a task 
> comes in with a big deadline deficit.
>
> And even the deadline-deficit wakeup preemption can be turned off further 
> with:
>
>     $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> And we are considering making that the default behavior for same-prio tasks 
> - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which 
> should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

I don't think that you can get rid of NEED_RESCHED_LAZY for !RT because
there is a clear advantage of having the return to user preemption
point.

It spares to have the kernel/user transition just to get the task back
via the timeslice interrupt. I experimented with that on RT and the
result was definitely worse.

We surely can revisit that, but I'd really start with the straight
forward mappable LAZY bit approach and if experimentation turns out to
provide good enough results by not setting that bit at all, then we
still can do so without changing anything except the core scheduler
decision logic.

It's again a cheap thing due to the way how the return to user TIF
handling works:

	ti_work = read_thread_flags();
	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
		ti_work = exit_to_user_mode_loop(regs, ti_work);

TIF_LAZY_RESCHED is part of EXIT_TO_USER_MODE_WORK, so the non-work case
does not become more expensive than today. If any of the bits is set,
then the slowpath wont get measurably different performance whether the bit
is evaluated or not in exit_to_user_mode_loop().

As we really want TIF_LAZY_RESCHED for RT, we just keep all of this
consistent in terms of code and purely a scheduler decision whether it
utilizes it or not. As a consequence PREEMPT_RT is not longer special in
that regard and the main RT difference becomes the lock substitution and
forced interrupt threading.

For the magic 'spare me the extra conditional' optimization of
exit_to_user_mode_loop() if LAZY can be optimized out for !RT because
the scheduler is sooo clever (which I doubt), we can just use the same
approach as for other TIF bits and define them to 0 :)

So lets start consistent and optimize on top if really required.

Thanks,

        tglx
John Paul Adrian Glaubitz Sept. 19, 2023, 1:48 p.m. UTC | #48
On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
> > that people are still working on! Can we please not do this?
> 
> If you're working on one of them, then surely it's a simple matter of
> working on adding CONFIG_PREEMPT support :-)

As Geert poined out, I'm not seeing anything particular problematic with the
architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
something about organizing KConfig files.

I find it a bit unfair that maintainers of architectures that have huge companies
behind them use their manpower to urge less popular architectures for removal just
because they don't have 150 people working on the port so they can keep up with
design changes quickly.

Adrian
Peter Zijlstra Sept. 19, 2023, 2:16 p.m. UTC | #49
On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote:
> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
> > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
> > > that people are still working on! Can we please not do this?
> > 
> > If you're working on one of them, then surely it's a simple matter of
> > working on adding CONFIG_PREEMPT support :-)
> 
> As Geert poined out, I'm not seeing anything particular problematic with the
> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> something about organizing KConfig files.

The plan in the parent thread is to remove PREEMPT_NONE and
PREEMPT_VOLUNTARY and only keep PREEMPT_FULL.

> I find it a bit unfair that maintainers of architectures that have huge companies
> behind them use their manpower to urge less popular architectures for removal just
> because they don't have 150 people working on the port so they can keep up with
> design changes quickly.

PREEMPT isn't something new. Also, I don't think the arch part for
actually supporting it is particularly hard, mostly it is sticking the
preempt_schedule_irq() call in return from interrupt code path.

If you convert the arch to generic-entry (a much larger undertaking)
then you get this for free.
Thomas Gleixner Sept. 19, 2023, 2:17 p.m. UTC | #50
On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote:
> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
>> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
>> > that people are still working on! Can we please not do this?
>> 
>> If you're working on one of them, then surely it's a simple matter of
>> working on adding CONFIG_PREEMPT support :-)
>
> As Geert poined out, I'm not seeing anything particular problematic with the
> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> something about organizing KConfig files.
>
> I find it a bit unfair that maintainers of architectures that have huge companies
> behind them use their manpower to urge less popular architectures for removal just
> because they don't have 150 people working on the port so they can keep up with
> design changes quickly.

I don't urge for removal. I just noticed that these four architectures
lack PREEMPT support. The only thing which is missing is the actual
preemption point in the return to kernel code path.

But otherwise it should just work, which I obviously can't confirm :)

Even without that preemption point it should build and boot. There might
be some minor latency issues when that preemption point is not there,
but adding it is not rocket science either. It's probably about 10 lines
of ASM code, if at all.

Though not adding that might cause a blocking issue for the rework of
the whole preemption logic in order to remove the sprinkled around
cond_resched() muck or force us to maintain some nasty workaround just
for the benefit of a few stranglers.

So I can make the same argument the other way around, that it's
unjustified that some architectures which are just supported for
nostalgia throw roadblocks into kernel developemnt.

If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack
myself, but that's going to cost more of my and your time than it's
worth the trouble,

Hmm. I could delegate that to Linus, he might still remember :)

Thanks,

        tglx
Anton Ivanov Sept. 19, 2023, 2:21 p.m. UTC | #51
On 19/09/2023 14:42, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 03:37:24PM +0200, John Paul Adrian Glaubitz wrote:
>> On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote:
>>> On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
>>>> Though it just occured to me that there are dragons lurking:
>>>>
>>>> arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
>>>> arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
>>>> arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
>>>> arch/um/Kconfig:        select ARCH_NO_PREEMPT
>>>
>>> Sounds like three-and-a-half architectures which could be queued up for
>>> removal right behind ia64 ...
>>
>> The agreement to kill off ia64 wasn't an invitation to kill off other stuff
>> that people are still working on! Can we please not do this?
> 
> If you're working on one of them, then surely it's a simple matter of
> working on adding CONFIG_PREEMPT support :-)

In the case of UML adding preempt will be quite difficult. I looked at this a few years back.

At the same time it is used for kernel test and other stuff. It is not exactly abandonware on a CPU found in archaeological artifacts of past civilizations like ia64.

> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
John Paul Adrian Glaubitz Sept. 19, 2023, 2:24 p.m. UTC | #52
On Tue, 2023-09-19 at 16:16 +0200, Peter Zijlstra wrote:
> > I find it a bit unfair that maintainers of architectures that have huge companies
> > behind them use their manpower to urge less popular architectures for removal just
> > because they don't have 150 people working on the port so they can keep up with
> > design changes quickly.
> 
> PREEMPT isn't something new. Also, I don't think the arch part for
> actually supporting it is particularly hard, mostly it is sticking the
> preempt_schedule_irq() call in return from interrupt code path.
> 
> If you convert the arch to generic-entry (a much larger undertaking)
> then you get this for free.

If the conversion isn't hard, why is the first reflex the urge to remove an architecture
instead of offering advise how to get the conversion done?

Adrian
Matthew Wilcox Sept. 19, 2023, 2:32 p.m. UTC | #53
On Tue, Sep 19, 2023 at 04:24:48PM +0200, John Paul Adrian Glaubitz wrote:
> If the conversion isn't hard, why is the first reflex the urge to remove an architecture
> instead of offering advise how to get the conversion done?

Because PREEMPT has been around since before 2005 (cc19ca86a023 created
Kconfig.preempt and I don't need to go back further than that to make my
point), and you haven't done the work yet.  Clearly it takes the threat
of removal to get some kind of motion.
H. Peter Anvin Sept. 19, 2023, 2:50 p.m. UTC | #54
On September 19, 2023 7:17:04 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote:
>> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
>>> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
>>> > that people are still working on! Can we please not do this?
>>> 
>>> If you're working on one of them, then surely it's a simple matter of
>>> working on adding CONFIG_PREEMPT support :-)
>>
>> As Geert poined out, I'm not seeing anything particular problematic with the
>> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
>> something about organizing KConfig files.
>>
>> I find it a bit unfair that maintainers of architectures that have huge companies
>> behind them use their manpower to urge less popular architectures for removal just
>> because they don't have 150 people working on the port so they can keep up with
>> design changes quickly.
>
>I don't urge for removal. I just noticed that these four architectures
>lack PREEMPT support. The only thing which is missing is the actual
>preemption point in the return to kernel code path.
>
>But otherwise it should just work, which I obviously can't confirm :)
>
>Even without that preemption point it should build and boot. There might
>be some minor latency issues when that preemption point is not there,
>but adding it is not rocket science either. It's probably about 10 lines
>of ASM code, if at all.
>
>Though not adding that might cause a blocking issue for the rework of
>the whole preemption logic in order to remove the sprinkled around
>cond_resched() muck or force us to maintain some nasty workaround just
>for the benefit of a few stranglers.
>
>So I can make the same argument the other way around, that it's
>unjustified that some architectures which are just supported for
>nostalgia throw roadblocks into kernel developemnt.
>
>If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack
>myself, but that's going to cost more of my and your time than it's
>worth the trouble,
>
>Hmm. I could delegate that to Linus, he might still remember :)
>
>Thanks,
>
>        tglx

Does *anyone* actually run Alpha at this point?
Matt Turner Sept. 19, 2023, 2:57 p.m. UTC | #55
On Tue, Sep 19, 2023 at 10:51 AM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On September 19, 2023 7:17:04 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
> >On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote:
> >> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
> >>> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
> >>> > that people are still working on! Can we please not do this?
> >>>
> >>> If you're working on one of them, then surely it's a simple matter of
> >>> working on adding CONFIG_PREEMPT support :-)
> >>
> >> As Geert poined out, I'm not seeing anything particular problematic with the
> >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> >> something about organizing KConfig files.
> >>
> >> I find it a bit unfair that maintainers of architectures that have huge companies
> >> behind them use their manpower to urge less popular architectures for removal just
> >> because they don't have 150 people working on the port so they can keep up with
> >> design changes quickly.
> >
> >I don't urge for removal. I just noticed that these four architectures
> >lack PREEMPT support. The only thing which is missing is the actual
> >preemption point in the return to kernel code path.
> >
> >But otherwise it should just work, which I obviously can't confirm :)
> >
> >Even without that preemption point it should build and boot. There might
> >be some minor latency issues when that preemption point is not there,
> >but adding it is not rocket science either. It's probably about 10 lines
> >of ASM code, if at all.
> >
> >Though not adding that might cause a blocking issue for the rework of
> >the whole preemption logic in order to remove the sprinkled around
> >cond_resched() muck or force us to maintain some nasty workaround just
> >for the benefit of a few stranglers.
> >
> >So I can make the same argument the other way around, that it's
> >unjustified that some architectures which are just supported for
> >nostalgia throw roadblocks into kernel developemnt.
> >
> >If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack
> >myself, but that's going to cost more of my and your time than it's
> >worth the trouble,
> >
> >Hmm. I could delegate that to Linus, he might still remember :)
> >
> >Thanks,
> >
> >        tglx
>
> Does *anyone* actually run Alpha at this point?

I do, as part of maintaining the Gentoo distribution for Alpha.

I'm listed in MAINTAINERS, but really only so I can collect patches
send them to Linus after testing. I don't have copious amounts of free
time to be proactive in kernel development and it's also not really my
area of expertise so I'm nowhere near effective at it.

I would be happy to test any patches sent my way (but I acknowledge
that writing these patches wouldn't be high on anyone's priority list,
etc)

(A video my friend Ian and I made about a particularly large
AlphaServer I have in my basement, in case anyone is interested:
https://www.youtube.com/watch?v=z658a8Js5qg)
Thomas Gleixner Sept. 19, 2023, 3:17 p.m. UTC | #56
On Tue, Sep 19 2023 at 15:21, Anton Ivanov wrote:
> On 19/09/2023 14:42, Peter Zijlstra wrote:
>> If you're working on one of them, then surely it's a simple matter of
>> working on adding CONFIG_PREEMPT support :-)
>
> In the case of UML adding preempt will be quite difficult. I looked at
> this a few years back.

What's so difficult about it?

Thanks,

        tglx
Anton Ivanov Sept. 19, 2023, 3:21 p.m. UTC | #57
On 19/09/2023 16:17, Thomas Gleixner wrote:
> On Tue, Sep 19 2023 at 15:21, Anton Ivanov wrote:
>> On 19/09/2023 14:42, Peter Zijlstra wrote:
>>> If you're working on one of them, then surely it's a simple matter of
>>> working on adding CONFIG_PREEMPT support :-)
>> In the case of UML adding preempt will be quite difficult. I looked at
>> this a few years back.
> What's so difficult about it?

It's been a while. I remember that I dropped it at the time, but do not remember the full details.

There was some stuff related to FP state and a few other issues I ran into while rewriting the interrupt controller. Some of it may be resolved by now as we are using host cpu flags, etc.

I can give it another go :)

>
> Thanks,
>
>          tglx
>
Steven Rostedt Sept. 19, 2023, 3:31 p.m. UTC | #58
On Tue, 19 Sep 2023 15:32:05 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, Sep 19, 2023 at 04:24:48PM +0200, John Paul Adrian Glaubitz wrote:
> > If the conversion isn't hard, why is the first reflex the urge to remove an architecture
> > instead of offering advise how to get the conversion done?  
> 
> Because PREEMPT has been around since before 2005 (cc19ca86a023 created
> Kconfig.preempt and I don't need to go back further than that to make my
> point), and you haven't done the work yet.  Clearly it takes the threat
> of removal to get some kind of motion.

Or the use case of a preempt kernel on said arch has never been a request.
Just because it was available doesn't necessarily mean it's required.

Please, let's not jump to threats of removal just to get a feature in.
Simply ask first. I didn't see anyone reaching out to the maintainers
asking for this as it will be needed for a new feature that will likely
make maintaining said arch easier.

Everything is still in brainstorming mode.

-- Steve
Richard Weinberger Sept. 19, 2023, 4:22 p.m. UTC | #59
----- Ursprüngliche Mail -----
> Von: "anton ivanov" <anton.ivanov@cambridgegreys.com>
> It's been a while. I remember that I dropped it at the time, but do not remember
> the full details.
> 
> There was some stuff related to FP state and a few other issues I ran into while
> rewriting the interrupt controller. Some of it may be resolved by now as we are
> using host cpu flags, etc.

I remember also having a hacky but working version almost 10 years ago.
It was horrible slow because of the extra scheduler rounds.
But yes, if PREEMPT will be a must-have feature we'll have to try again.

Thanks,
//richard
Anton Ivanov Sept. 19, 2023, 4:41 p.m. UTC | #60
On 19/09/2023 17:22, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
>> Von: "anton ivanov" <anton.ivanov@cambridgegreys.com>
>> It's been a while. I remember that I dropped it at the time, but do not remember
>> the full details.
>>
>> There was some stuff related to FP state and a few other issues I ran into while
>> rewriting the interrupt controller. Some of it may be resolved by now as we are
>> using host cpu flags, etc.
> 
> I remember also having a hacky but working version almost 10 years ago.
> It was horrible slow because of the extra scheduler rounds.
> But yes, if PREEMPT will be a must-have feature we'll have to try again.

We will need proper fpu primitives for starters that's for sure. fpu_star/end in UML are presently NOOP.

Some of the default spinlocks and other stuff which we pick up from generic may need to change as well.

This is off the top of my head and something which we can fix straight away. I will send some patches to the mailing list tomorrow or on Thu.

A.

> 
> Thanks,
> //richard
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
Ulrich Teichert Sept. 19, 2023, 5:09 p.m. UTC | #61
Hi,

[del]
> Does *anyone* actually run Alpha at this point?

Yes, at least I'm still trying to keep my boxes running from time to time,

CU,
Uli
Linus Torvalds Sept. 19, 2023, 5:25 p.m. UTC | #62
On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz
<glaubitz@physik.fu-berlin.de> wrote:
>
> As Geert poined out, I'm not seeing anything particular problematic with the
> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> something about organizing KConfig files.

It can definitely be problematic.

Not the Kconfig file part, and not the preempt count part itself.

But the fact that it has never been used and tested means that there
might be tons of "this architecture code knows it's not preemptible,
because this architecture doesn't support preemption".

So you may have basic architecture code that simply doesn't have the
"preempt_disable()/enable()" pairs that it needs.

PeterZ mentioned the generic entry code, which does this for the entry
path. But it actually goes much deeper: just do a

    git grep preempt_disable arch/x86/kernel

and then do the same for some other architectures.

Looking at alpha, for example, there *are* hits for it, so at least
some of the code there clearly *tries* to do it. But does it cover all
the required parts? If it's never been tested, I'd be surprised if
it's all just ready to go.

I do think we'd need to basically continue to support ARCH_NO_PREEMPT
- and such architectures migth end up with the worst-cast latencies of
only scheduling at return to user space.

               Linus
Thomas Gleixner Sept. 19, 2023, 5:33 p.m. UTC | #63
On Tue, Sep 19 2023 at 17:41, Anton Ivanov wrote:
> On 19/09/2023 17:22, Richard Weinberger wrote:
>> ----- Ursprüngliche Mail -----
>>> Von: "anton ivanov" <anton.ivanov@cambridgegreys.com>
>>> It's been a while. I remember that I dropped it at the time, but do not remember
>>> the full details.
>>>
>>> There was some stuff related to FP state and a few other issues I ran into while
>>> rewriting the interrupt controller. Some of it may be resolved by now as we are
>>> using host cpu flags, etc.
>> 
>> I remember also having a hacky but working version almost 10 years ago.
>> It was horrible slow because of the extra scheduler rounds.

Which can be completely avoided as the proposed change will have the
preemption points, but they are only utilized when preempt FULL is
enabled (at boot or runtime). So the behaviour can still be like preempt
NONE, but with a twist to get rid of the cond_resched()/might_resched()
and other heuristic approaches to prevent starvation by long running
functions. That twist needs the preemption points.

See https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx

>> But yes, if PREEMPT will be a must-have feature we'll have to try again.
>
> We will need proper fpu primitives for starters that's for
> sure. fpu_star/end in UML are presently NOOP.
>
> Some of the default spinlocks and other stuff which we pick up from
> generic may need to change as well.
>
> This is off the top of my head and something which we can fix straight
> away. I will send some patches to the mailing list tomorrow or on Thu.

I think it does not have to be perfect. UM is far from perfect in
mimicing a real kernel. The main point is that it provides the preempt
counter in the first place and some minimal amount of preemption points
aside of those which come with the preempt_enable() machinery for free.

Thanks,

        tglx
John Paul Adrian Glaubitz Sept. 19, 2023, 5:58 p.m. UTC | #64
Hi Linus!

On Tue, 2023-09-19 at 10:25 -0700, Linus Torvalds wrote:
> On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz
> <glaubitz@physik.fu-berlin.de> wrote:
> > 
> > As Geert poined out, I'm not seeing anything particular problematic with the
> > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> > something about organizing KConfig files.
> 
> It can definitely be problematic.
> 
> Not the Kconfig file part, and not the preempt count part itself.
> 
> But the fact that it has never been used and tested means that there
> might be tons of "this architecture code knows it's not preemptible,
> because this architecture doesn't support preemption".
> 
> So you may have basic architecture code that simply doesn't have the
> "preempt_disable()/enable()" pairs that it needs.
> 
> PeterZ mentioned the generic entry code, which does this for the entry
> path. But it actually goes much deeper: just do a
> 
>     git grep preempt_disable arch/x86/kernel
> 
> and then do the same for some other architectures.
> 
> Looking at alpha, for example, there *are* hits for it, so at least
> some of the code there clearly *tries* to do it. But does it cover all
> the required parts? If it's never been tested, I'd be surprised if
> it's all just ready to go.

Thanks for the detailed explanation.

> I do think we'd need to basically continue to support ARCH_NO_PREEMPT
> - and such architectures migth end up with the worst-cast latencies of
> only scheduling at return to user space.

Great to hear, thank you.

And, yes, eventually I would be happy to help get alpha and m68k converted.

Adrian
Thomas Gleixner Sept. 19, 2023, 6:31 p.m. UTC | #65
On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote:
> On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz
> <glaubitz@physik.fu-berlin.de> wrote:
>>
>> As Geert poined out, I'm not seeing anything particular problematic with the
>> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
>> something about organizing KConfig files.
>
> It can definitely be problematic.
>
> Not the Kconfig file part, and not the preempt count part itself.
>
> But the fact that it has never been used and tested means that there
> might be tons of "this architecture code knows it's not preemptible,
> because this architecture doesn't support preemption".
>
> So you may have basic architecture code that simply doesn't have the
> "preempt_disable()/enable()" pairs that it needs.
>
> PeterZ mentioned the generic entry code, which does this for the entry
> path. But it actually goes much deeper: just do a
>
>     git grep preempt_disable arch/x86/kernel
>
> and then do the same for some other architectures.
>
> Looking at alpha, for example, there *are* hits for it, so at least
> some of the code there clearly *tries* to do it. But does it cover all
> the required parts? If it's never been tested, I'd be surprised if
> it's all just ready to go.
>
> I do think we'd need to basically continue to support ARCH_NO_PREEMPT
> - and such architectures migth end up with the worst-cast latencies of
> only scheduling at return to user space.

The only thing these architectures should gain is the preempt counter
itself, but yes the extra preemption points are not mandatory to
have, i.e. we simply do not enable them for the nostalgia club.

The removal of cond_resched() might cause latencies, but then I doubt
that these museus pieces are used for real work :)

Thanks,

        tglx
Steven Rostedt Sept. 19, 2023, 6:38 p.m. UTC | #66
On Tue, 19 Sep 2023 20:31:50 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> The removal of cond_resched() might cause latencies, but then I doubt
> that these museus pieces are used for real work :)

We could simply leave the cond_resched() around but defined as nops for
everything but the "nostalgia club" to keep them from having any regressions.

-- Steve
Linus Torvalds Sept. 19, 2023, 6:52 p.m. UTC | #67
On Tue, 19 Sept 2023 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> We could simply leave the cond_resched() around but defined as nops for
> everything but the "nostalgia club" to keep them from having any regressions.

I doubt the nostalgia club cares about some latencies (that are
usually only noticeable under extreme loads anyway).

And if they do, maybe that would make somebody sit down and look into
doing it right.

So I think keeping it around would actually be both useless and
counter-productive.

              Linus
Ankur Arora Sept. 19, 2023, 7:05 p.m. UTC | #68
Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
>> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>>> > The problem with the REP prefix (and Xen hypercalls) is that
>>> > they're long running instructions and it becomes fundamentally
>>> > impossible to put a cond_resched() in.
>>> >
>>> >> Yes. I'm starting to think that that the only sane solution is to
>>> >> limit cases that can do this a lot, and the "instruciton pointer
>>> >> region" approach would certainly work.
>>> >
>>> > From a code locality / I-cache POV, I think a sorted list of
>>> > (non overlapping) ranges might be best.
>>>
>>> Yeah, agreed. There are a few problems with doing that though.
>>>
>>> I was thinking of using a check of this kind to schedule out when
>>> it is executing in this "reschedulable" section:
>>>         !preempt_count() && in_resched_function(regs->rip);
>>>
>>> For preemption=full, this should mostly work.
>>> For preemption=voluntary, though this'll only work with out-of-line
>>> locks, not if the lock is inlined.
>>>
>>> (Both, should have problems with __this_cpu_* and the like, but
>>> maybe we can handwave that away with sparse/objtool etc.)
>>
>> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
>> thing, and then only search the range when TIF flag is set.
>>
>> And I'm thinking it might be a good idea to have objtool validate the
>> range only contains simple instructions, the moment it contains control
>> flow I'm thinking it's too complicated.
>
> Can we take a step back and look at the problem from a scheduling
> perspective?
>
> The basic operation of a non-preemptible kernel is time slice
> scheduling, which means that a task can run more or less undisturbed for
> a full time slice once it gets on the CPU unless it schedules away
> voluntary via a blocking operation.
>
> This works pretty well as long as everything runs in userspace as the
> preemption points in the return to user space path are independent of
> the preemption model.
>
> These preemption points handle both time slice exhaustion and priority
> based preemption.
>
> With PREEMPT=NONE these are the only available preemption points.
>
> That means that kernel code can run more or less indefinitely until it
> schedules out or returns to user space, which is obviously not possible
> for kernel threads.
>
> To prevent starvation the kernel gained voluntary preemption points,
> i.e. cond_resched(), which has to be added manually to code as a
> developer sees fit.
>
> Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
> additional preemption points. might_resched() utilizes the existing
> might_sched() debug points, which are in code paths which might block on
> a contended resource. These debug points are mostly in core and
> infrastructure code and are in code paths which can block anyway. The
> only difference is that they allow preemption even when the resource is
> uncontended.
>
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
>
> Now we have the situation of long running data copies or data clear
> operations which run fully in hardware, but can be interrupted. As the
> interrupt return to kernel mode does not preempt in the NONE and
> VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
> chunk size and adding cond_reched() again.
>
> That's ugly and does not work for long lasting hardware operations so we
> ended up with the suggestion of TIF_ALLOW_RESCHED to work around
> that. But again this needs to be manually annotated in the same way as a
> IP range based preemption scheme requires annotation.
>
> TBH. I detest all of this.
>
> Both cond_resched() and might_sleep/sched() are completely random
> mechanisms as seen from time slice operation and the data chunk based
> mechanism is just heuristics which works as good as heuristics tend to
> work. allow_resched() is not any different and IP based preemption
> mechanism are not going to be any better.

Agreed. I was looking at how to add resched sections etc, and in
addition to the randomness the choice of where exactly to add it seemed
to be quite fuzzy. A recipe for future kruft.

> The approach here is: Prevent the scheduler to make decisions and then
> mitigate the fallout with heuristics.
>
> That's just backwards as it moves resource control out of the scheduler
> into random code which has absolutely no business to do resource
> control.
>
> We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
> held sections became preemtible caused even more preemption activity
> than on a PREEMPT=FULL kernel. The worst side effect of that was
> extensive lock contention.
>
> The way how we addressed that was to add a lazy preemption mode, which
> tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
> preempt tasks which all belong to the SCHED_OTHER scheduling class. This
> works pretty well and gains back a massive amount of performance for the
> non-realtime throughput oriented tasks without affecting the
> schedulability of real-time tasks at all. IOW, it does not take control
> away from the scheduler. It cooperates with the scheduler and leaves the
> ultimate decisions to it.
>
> I think we can do something similar for the problem at hand, which
> avoids most of these heuristic horrors and control boundary violations.
>
> The main issue is that long running operations do not honour the time
> slice and we work around that with cond_resched() and now have ideas
> with this new TIF bit and IP ranges.
>
> None of that is really well defined in respect to time slices. In fact
> its not defined at all versus any aspect of scheduling behaviour.
>
> What about the following:
>
>    1) Keep preemption count and the real preemption points enabled
>       unconditionally. That's not more overhead than the current
>       DYNAMIC_PREEMPT mechanism as long as the preemption count does not
>       go to zero, i.e. the folded NEED_RESCHED bit stays set.
>
>       From earlier experiments I know that the overhead of preempt_count
>       is minimal and only really observable with micro benchmarks.
>       Otherwise it ends up in the noise as long as the slow path is not
>       taken.
>
>       I did a quick check comparing a plain inc/dec pair vs. the
>       DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
>       in the non-conclusive noise.
>
>       20 years ago this was a real issue because we did not have:
>
>        - the folding of NEED_RESCHED into the preempt count
>
>        - the cacheline optimizations which make the preempt count cache
>          pretty much always cache hot
>
>        - the hardware was way less capable
>
>       I'm not saying that preempt_count is completely free today as it
>       obviously adds more text and affects branch predictors, but as the
>       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
>       acceptable and tolerable tradeoff.
>
>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.
>
>       As NEED_RESCHED_LAZY is not folded into the preemption count the
>       preemption count won't become zero, so the task can continue until
>       it hits return to user space.
>
>       That preserves the existing behaviour.
>
>    3) When the scheduler tick observes that the time slice is exhausted,
>       then it folds the NEED_RESCHED bit into the preempt count which
>       causes the real preemption points to actually preempt including
>       the return from interrupt to kernel path.

Right, and currently we check cond_resched() all the time in expectation
that something might need a resched.

Folding it in with the scheduler determining when next preemption happens
seems to make a lot of sense to me.


Thanks
Ankur

>       That even allows the scheduler to enforce preemption for e.g. RT
>       class tasks without changing anything else.
>
>       I'm pretty sure that this gets rid of cond_resched(), which is an
>       impressive list of instances:
>
> 	./drivers	 392
> 	./fs		 318
> 	./mm		 189
> 	./kernel	 184
> 	./arch		  95
> 	./net		  83
> 	./include	  46
> 	./lib		  36
> 	./crypto	  16
> 	./sound		  16
> 	./block		  11
> 	./io_uring	  13
> 	./security	  11
> 	./ipc		   3
>
>       That list clearly documents that the majority of these
>       cond_resched() invocations is in code which neither should care
>       nor should have any influence on the core scheduling decision
>       machinery.
>
> I think it's worth a try as it just fits into the existing preemption
> scheme, solves the issue of long running kernel functions, prevents
> invalid preemption and can utilize the existing instrumentation and
> debug infrastructure.
>
> Most importantly it gives control back to the scheduler and does not
> make it depend on the mercy of cond_resched(), allow_resched() or
> whatever heuristics sprinkled all over the kernel.

> To me this makes a lot of sense, but I might be on the completely wrong
> track. Se feel free to tell me that I'm completely nuts and/or just not
> seeing the obvious.
>
> Thanks,
>
>         tglx


--
ankur
Thomas Gleixner Sept. 19, 2023, 7:53 p.m. UTC | #69
On Tue, Sep 19 2023 at 11:52, Linus Torvalds wrote:
> On Tue, 19 Sept 2023 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> We could simply leave the cond_resched() around but defined as nops for
>> everything but the "nostalgia club" to keep them from having any regressions.
>
> I doubt the nostalgia club cares about some latencies (that are
> usually only noticeable under extreme loads anyway).
>
> And if they do, maybe that would make somebody sit down and look into
> doing it right.
>
> So I think keeping it around would actually be both useless and
> counter-productive.

Amen to that.
Ingo Molnar Sept. 20, 2023, 7:29 a.m. UTC | #70
* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote:
> > On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz
> > <glaubitz@physik.fu-berlin.de> wrote:
> >>
> >> As Geert poined out, I'm not seeing anything particular problematic with the
> >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> >> something about organizing KConfig files.
> >
> > It can definitely be problematic.
> >
> > Not the Kconfig file part, and not the preempt count part itself.
> >
> > But the fact that it has never been used and tested means that there
> > might be tons of "this architecture code knows it's not preemptible,
> > because this architecture doesn't support preemption".
> >
> > So you may have basic architecture code that simply doesn't have the
> > "preempt_disable()/enable()" pairs that it needs.
> >
> > PeterZ mentioned the generic entry code, which does this for the entry
> > path. But it actually goes much deeper: just do a
> >
> >     git grep preempt_disable arch/x86/kernel
> >
> > and then do the same for some other architectures.
> >
> > Looking at alpha, for example, there *are* hits for it, so at least
> > some of the code there clearly *tries* to do it. But does it cover all
> > the required parts? If it's never been tested, I'd be surprised if
> > it's all just ready to go.
> >
> > I do think we'd need to basically continue to support ARCH_NO_PREEMPT
> > - and such architectures migth end up with the worst-cast latencies of
> > only scheduling at return to user space.
> 
> The only thing these architectures should gain is the preempt counter 
> itself, [...]

And if any of these machines are still used, there's the small benefit of 
preempt_count increasing debuggability of scheduling in supposedly 
preempt-off sections that were ignored silently previously, as most of 
these architectures do not even enable CONFIG_DEBUG_ATOMIC_SLEEP=y in their 
defconfigs:

  $ for ARCH in alpha hexagon m68k um; do git grep DEBUG_ATOMIC_SLEEP arch/$ARCH; done
  $

Plus the efficiency of CONFIG_DEBUG_ATOMIC_SLEEP=y is much reduced on 
non-PREEMPT kernels to begin with: it will basically only detect scheduling 
in hardirqs-off critical sections.

So IMHO there's a distinct debuggability & robustness plus in enabling the 
preemption count on all architectures, even if they don't or cannot use the 
rescheduling points.

> [...] but yes the extra preemption points are not mandatory to have, i.e. 
> we simply do not enable them for the nostalgia club.
> 
> The removal of cond_resched() might cause latencies, but then I doubt 
> that these museus pieces are used for real work :)

I'm not sure we should initially remove *explicit* legacy cond_resched() 
points, except from high-freq paths where they hurt - and of course remove 
them from might_sleep().

Thanks,

	Ingo
Ingo Molnar Sept. 20, 2023, 7:32 a.m. UTC | #71
* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 19 Sep 2023 20:31:50 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > The removal of cond_resched() might cause latencies, but then I doubt
> > that these museus pieces are used for real work :)
> 
> We could simply leave the cond_resched() around but defined as nops for
> everything but the "nostalgia club" to keep them from having any regressions.

That's not a good idea IMO, it's an invitation for accelerated rate bitrot 
turning cond_resched() meaningless very quickly.

We should remove cond_resched() - but probably not as the first step. They 
are conceptually independent of NEED_RESCHED_LAZY and we don't *have to* 
remove them straight away.

By removing cond_resched() separately there's an easily bisectable point to 
blame for any longer latencies on legacy platforms, should any of them 
still be used with recent kernels.

Thanks,

	Ingo
Thomas Gleixner Sept. 20, 2023, 8:26 a.m. UTC | #72
On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote:
> PeterZ mentioned the generic entry code, which does this for the entry
> path. But it actually goes much deeper: just do a
>
>     git grep preempt_disable arch/x86/kernel
>
> and then do the same for some other architectures.
>
> Looking at alpha, for example, there *are* hits for it, so at least
> some of the code there clearly *tries* to do it. But does it cover all
> the required parts? If it's never been tested, I'd be surprised if
> it's all just ready to go.

Interestingly enough m68k has zero instances, but it supports PREEMPT on
the COLDFIRE subarchitecture...
David Laight Sept. 20, 2023, 10:37 a.m. UTC | #73
From: Linus Torvalds
> Sent: 19 September 2023 18:25
> 
> On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz
> <glaubitz@physik.fu-berlin.de> wrote:
> >
> > As Geert poined out, I'm not seeing anything particular problematic with the
> > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
> > something about organizing KConfig files.
> 
> It can definitely be problematic.
> 
> Not the Kconfig file part, and not the preempt count part itself.
> 
> But the fact that it has never been used and tested means that there
> might be tons of "this architecture code knows it's not preemptible,
> because this architecture doesn't support preemption".

Do distos even build x86 kernels with PREEMPT_FULL?
I know I've had issues with massive latencies caused graphics driver
forcing write-backs of all the framebuffer memory.
(I think it is a failed attempt to fix a temporary display corruption.)

OTOH SMP support and CONFIG_RT will test most of the code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Ankur Arora Sept. 20, 2023, 2:22 p.m. UTC | #74
Thomas Gleixner <tglx@linutronix.de> writes:

> So the decision matrix would be:
>
>                 Ret2user        Ret2kernel      PreemptCnt=0
>
> NEED_RESCHED       Y                Y               Y
> LAZY_RESCHED       Y                N               N
>
> That is completely independent of the preemption model and the
> differentiation of the preemption models happens solely at the scheduler
> level:

This is relatively minor, but do we need two flags? Seems to me we
can get to the same decision matrix by letting the scheduler fold
into the preempt-count based on current preemption model.

> PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time
> slice where it sets NEED_RESCHED.

PREEMPT_NONE sets up TIF_NEED_RESCHED. For the time-slice expiry case,
also fold into preempt-count.

> PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class
> tasks or sporadic event tasks sets NEED_RESCHED too.

PREEMPT_NONE sets up TIF_NEED_RESCHED and also folds it for the
RT/sporadic tasks.

> PREEMPT_FULL always sets NEED_RESCHED like today.

Always fold the TIF_NEED_RESCHED into the preempt-count.

> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
> only end up with two variants or even subsume PREEMPT_FULL into that
> model because that's what is closer to the RT LAZY preempt behaviour,
> which has two goals:
>
>       1) Make low latency guarantees for RT workloads
>
>       2) Preserve the throughput for non-RT workloads
>
> But in any case this decision happens solely in the core scheduler code
> and nothing outside of it needs to be changed.
>
> So we not only get rid of the cond/might_resched() muck, we also get rid
> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
> The only place which still needs that runtime tweaking is the scheduler
> itself.

True. The dynamic preemption could just become a scheduler tunable.

> Though it just occured to me that there are dragons lurking:
>
> arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig:        select ARCH_NO_PREEMPT
>
> So we have four architectures which refuse to enable preemption points,
> i.e. the only model they allow is NONE and they rely on cond_resched()
> for breaking large computations.
>
> But they support PREEMPT_COUNT, so we might get away with a reduced
> preemption point coverage:
>
>                 Ret2user        Ret2kernel      PreemptCnt=0
>
> NEED_RESCHED       Y                N               Y
> LAZY_RESCHED       Y                N               N

So from the discussion in the other thread, for the ARCH_NO_PREEMPT
configs that don't support preemption, we probably need a fourth
preemption model, say PREEMPT_UNSAFE.

These could use only the Ret2user preemption points and just fallback
to the !PREEMPT_COUNT primitives.

Thanks

--
ankur
Anton Ivanov Sept. 20, 2023, 2:38 p.m. UTC | #75
On 19/09/2023 15:16, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote:
>> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
>>>> The agreement to kill off ia64 wasn't an invitation to kill off other stuff
>>>> that people are still working on! Can we please not do this?
>>>
>>> If you're working on one of them, then surely it's a simple matter of
>>> working on adding CONFIG_PREEMPT support :-)
>>
>> As Geert poined out, I'm not seeing anything particular problematic with the
>> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
>> something about organizing KConfig files.
> 
> The plan in the parent thread is to remove PREEMPT_NONE and
> PREEMPT_VOLUNTARY and only keep PREEMPT_FULL.
> 
>> I find it a bit unfair that maintainers of architectures that have huge companies
>> behind them use their manpower to urge less popular architectures for removal just
>> because they don't have 150 people working on the port so they can keep up with
>> design changes quickly.
> 
> PREEMPT isn't something new. Also, I don't think the arch part for
> actually supporting it is particularly hard, mostly it is sticking the
> preempt_schedule_irq() call in return from interrupt code path.

That calls local_irq_enable() which does various signal related/irq pending work on UML. That in turn does no like being invoked again (as you may have already been invoked out of that) in the IRQ return path.

So it is likely to end up being slightly more difficult than that for UML - it will need to be wrapped so it can be invoked from the "host" side signal code as well as invoked with some additional checks to avoid making a hash out of the IRQ handling.

It may be necessary to modify some of the existing reentrancy prevention logic in the signal handlers as well and change it to make use of the preempt count instead of its own flags/counters.

> 
> If you convert the arch to generic-entry (a much larger undertaking)
> then you get this for free.
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
Thomas Gleixner Sept. 20, 2023, 8:51 p.m. UTC | #76
On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> So the decision matrix would be:
>>
>>                 Ret2user        Ret2kernel      PreemptCnt=0
>>
>> NEED_RESCHED       Y                Y               Y
>> LAZY_RESCHED       Y                N               N
>>
>> That is completely independent of the preemption model and the
>> differentiation of the preemption models happens solely at the scheduler
>> level:
>
> This is relatively minor, but do we need two flags? Seems to me we
> can get to the same decision matrix by letting the scheduler fold
> into the preempt-count based on current preemption model.

You still need the TIF flags because there is no way to do remote
modification of preempt count.

The preempt count folding is an optimization which simplifies the
preempt_enable logic:

	if (--preempt_count && need_resched())
		schedule()
to
	if (--preempt_count)
		schedule()

i.e. a single conditional instead of two.

The lazy bit is only evaluated in:

    1) The return to user path

    2) need_reched()

In neither case preempt_count is involved.

So it does not buy us enything. We might revisit that later, but for
simplicity sake the extra TIF bit is way simpler.

Premature optimization is the enemy of correctness.

>> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we
>> only end up with two variants or even subsume PREEMPT_FULL into that
>> model because that's what is closer to the RT LAZY preempt behaviour,
>> which has two goals:
>>
>>       1) Make low latency guarantees for RT workloads
>>
>>       2) Preserve the throughput for non-RT workloads
>>
>> But in any case this decision happens solely in the core scheduler code
>> and nothing outside of it needs to be changed.
>>
>> So we not only get rid of the cond/might_resched() muck, we also get rid
>> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC.
>> The only place which still needs that runtime tweaking is the scheduler
>> itself.
>
> True. The dynamic preemption could just become a scheduler tunable.

That's the point.

>> But they support PREEMPT_COUNT, so we might get away with a reduced
>> preemption point coverage:
>>
>>                 Ret2user        Ret2kernel      PreemptCnt=0
>>
>> NEED_RESCHED       Y                N               Y
>> LAZY_RESCHED       Y                N               N
>
> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
> configs that don't support preemption, we probably need a fourth
> preemption model, say PREEMPT_UNSAFE.

As discussed they wont really notice the latency issues because the
museum pieces are not used for anything crucial and for UM that's the
least of the correctness worries.

So no, we don't need yet another knob. We keep them chucking along and
if they really want they can adopt to the new world order. :)

Thanks,

        tglx
Thomas Gleixner Sept. 20, 2023, 11:58 p.m. UTC | #77
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>> that is presumably not very widely used, and we'd simplify a lot of
>> issues, and get rid of all these badly defined "cond_preempt()"
>> things.
>
> Hmm. Didn't I promise a year ago that I won't do further large scale
> cleanups and simplifications beyond printk.
>
> Maybe I get away this time with just suggesting it. :)

Maybe not. As I'm inveterate curious, I sat down and figured out how
that might look like.

To some extent I really curse my curiosity as the amount of macro maze,
config options and convoluted mess behind all these preempt mechanisms
is beyond disgusting.

Find below a PoC which implements that scheme. It's not even close to
correct, but it builds, boots and survives lightweight testing.

I did not even try to look into time-slice enforcement, but I really want
to share this for illustration and for others to experiment.

This keeps all the existing mechanisms in place and introduces a new
config knob in the preemption model Kconfig switch: PREEMPT_AUTO

If selected it builds a CONFIG_PREEMPT kernel, which disables the
cond_resched() machinery and switches the fair scheduler class to use
the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
the preempt NONE model except that cond_resched() is a NOOP and I did
not validate the time-slice enforcement. The latter should be a
no-brainer to figure out and fix if required.

For run-time switching this to the FULL preemption model which always
uses TIF_NEED_RESCHED, you need to enable CONFIG_SCHED_DEBUG and then
you can enable "FULL" via:

  echo FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features

and switch back to some sort of "NONE" via

  echo NO_FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features

It seems to work as expected for a simple hackbench -l 10000 run:

	         	NO_FORCE_NEED_RESCHED       FORCE_NEED_RESCHED
schedule() [1] 		3646163		            2701641
preemption                12554			     927856
total                   3658717                     3629497

[1] is voluntary schedule() AND_ schedule() from return to user space. I
did not come around to account them separately yet, but for a quick
check this clearly shows that this "works" as advertised.

Of course this needs way more analysis than this quick PoC+check, but
you get the idea.

Contrary to other hot of the press hacks, I'm pretty sure it won't
destroy your hard-disk, but I won't recommend that you deploy it on your
alarm-clock as it might make you miss the bus.

If this concept holds, which I'm pretty convinced of by now, then this
is an opportunity to trade ~3000 lines of unholy hacks for about 100-200
lines of understandable code :)

Thanks,

        tglx
---
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/thread_info.h |    2 +
 drivers/acpi/processor_idle.c      |    2 -
 include/linux/entry-common.h       |    2 -
 include/linux/entry-kvm.h          |    2 -
 include/linux/sched.h              |   18 +++++++++++-----
 include/linux/sched/idle.h         |    8 +++----
 include/linux/thread_info.h        |   19 +++++++++++++++++
 kernel/Kconfig.preempt             |   12 +++++++++-
 kernel/entry/common.c              |    2 -
 kernel/sched/core.c                |   41 ++++++++++++++++++++++++-------------
 kernel/sched/fair.c                |   10 ++++-----
 kernel/sched/features.h            |    2 +
 kernel/sched/idle.c                |    3 --
 kernel/sched/sched.h               |    1 
 kernel/trace/trace.c               |    2 -
 16 files changed, 91 insertions(+), 36 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,14 @@ static inline void hrtick_rq_init(struct
 
 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
  * spurious IPIs.
  */
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
 {
 	struct thread_info *ti = task_thread_info(p);
-	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+	return !(fetch_or(&ti->flags, 1 << nr_bit) & _TIF_POLLING_NRFLAG);
 }
 
 /*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct tas
 }
 
 #else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit)
 {
-	set_tsk_need_resched(p);
+	set_tsk_thread_flag(p, nr_bit);
 	return true;
 }
 
@@ -1038,28 +1038,42 @@ void wake_up_q(struct wake_q_head *head)
  * might also involve a cross-CPU call to trigger the scheduler on
  * the target CPU.
  */
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int nr_bit)
 {
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
 	lockdep_assert_rq_held(rq);
 
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched_type(curr, nr_bit))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr);
-		set_preempt_need_resched();
+		set_tsk_thread_flag(curr, nr_bit);
+		if (nr_bit == TIF_NEED_RESCHED)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(curr, nr_bit)) {
+		if (nr_bit == TIF_NEED_RESCHED)
+			smp_send_reschedule(cpu);
+	} else {
 		trace_sched_wake_idle_without_ipi(cpu);
+	}
+}
+
+void resched_curr(struct rq *rq)
+{
+	__resched_curr(rq, TIF_NEED_RESCHED);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+	__resched_curr(rq, sched_feat(FORCE_NEED_RESCHED) ?
+		       TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY);
 }
 
 void resched_cpu(int cpu)
@@ -1132,7 +1146,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	if (set_nr_and_not_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8886,6 @@ static void __init preempt_dynamic_init(
 		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
 		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
 	}									 \
-	EXPORT_SYMBOL_GPL(preempt_model_##mode)
 
 PREEMPT_MODEL_ACCESSOR(none);
 PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,11 @@ enum syscall_work_bit {
 
 #include <asm/thread_info.h>
 
+#ifndef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
@@ -185,6 +190,13 @@ static __always_inline bool tif_need_res
 			     (unsigned long *)(&current_thread_info()->flags));
 }
 
+static __always_inline bool tif_need_resched_lazy(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+		arch_test_bit(TIF_NEED_RESCHED_LAZY,
+			      (unsigned long *)(&current_thread_info()->flags));
+}
+
 #else
 
 static __always_inline bool tif_need_resched(void)
@@ -193,6 +205,13 @@ static __always_inline bool tif_need_res
 			(unsigned long *)(&current_thread_info()->flags));
 }
 
+static __always_inline bool tif_need_resched_lazy(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+		test_bit(TIF_NEED_RESCHED_LAZY,
+			 (unsigned long *)(&current_thread_info()->flags));
+}
+
 #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
 
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,9 @@ config PREEMPT_BUILD
 	select PREEMPTION
 	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
 
+config HAVE_PREEMPT_AUTO
+	bool
+
 choice
 	prompt "Preemption Model"
 	default PREEMPT_NONE
@@ -67,6 +70,13 @@ config PREEMPT
 	  embedded system with latency requirements in the milliseconds
 	  range.
 
+config PREEMPT_AUTO
+	bool "Automagic preemption mode with runtime tweaking support"
+	depends on HAVE_PREEMPT_AUTO
+	select PREEMPT_BUILD
+	help
+	  Add some sensible blurb here
+
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
 	depends on EXPERT && ARCH_SUPPORTS_RT
@@ -95,7 +105,7 @@ config PREEMPTION
 
 config PREEMPT_DYNAMIC
 	bool "Preemption behaviour defined on boot"
-	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
 	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
 	select PREEMPT_BUILD
 	default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_UPROBE)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
 extern void reweight_task(struct task_struct *p, int prio);
 
 extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
 	update_ti_thread_flag(task_thread_info(tsk), flag, value);
 }
 
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
@@ -2069,13 +2069,21 @@ static inline void set_tsk_need_resched(
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
 }
 
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
 {
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
 
+static inline bool test_tsk_need_resched_type(struct task_struct *tsk,
+					      int nr_bit)
+{
+	return unlikely(test_tsk_thread_flag(tsk, 1 << nr_bit));
+}
+
 /*
  * cond_resched() and cond_resched_lock(): latency reduction via
  * explicit rescheduling in places that are safe. The return
@@ -2252,7 +2260,7 @@ static inline int rwlock_needbreak(rwloc
 
 static __always_inline bool need_resched(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(tif_need_resched_lazy() || tif_need_resched());
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -985,7 +985,7 @@ static void update_deadline(struct cfs_r
 	 * The task has consumed its request, reschedule.
 	 */
 	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
 	}
 }
@@ -5267,7 +5267,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 		return;
 	}
 	/*
@@ -5413,7 +5413,7 @@ static void __account_cfs_rq_runtime(str
 	 * hierarchy can be throttled
 	 */
 	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 }
 
 static __always_inline
@@ -5673,7 +5673,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
-		resched_curr(rq);
+		resched_curr_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
@@ -8073,7 +8073,7 @@ static void check_preempt_wakeup(struct
 	return;
 
 preempt:
-	resched_curr(rq);
+	resched_curr_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
  */
 static void __cpuidle acpi_safe_halt(void)
 {
-	if (!tif_need_resched()) {
+	if (!need_resched()) {
 		raw_safe_halt();
 		raw_local_irq_disable();
 	}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 #else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling
 
 static inline bool __must_check current_set_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 static inline bool __must_check current_clr_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 #endif
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
 	ct_cpuidle_enter();
 
 	raw_local_irq_enable();
-	while (!tif_need_resched() &&
-	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
+	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
 		cpu_relax();
 	raw_local_irq_disable();
 
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2720,7 +2720,7 @@ unsigned int tracing_gen_ctx_irq_test(un
 	if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
 		trace_flags |= TRACE_FLAG_BH_OFF;
 
-	if (tif_need_resched())
+	if (need_resched())
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
 	select HAVE_STATIC_CALL
 	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
 	select HAVE_PREEMPT_DYNAMIC_CALL
+	select HAVE_PREEMPT_AUTO
 	select HAVE_RSEQ
 	select HAVE_RUST			if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY	6	/* Lazy rescheduling */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -106,6 +107,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
Thomas Gleixner Sept. 21, 2023, 12:14 a.m. UTC | #78
On Wed, Sep 20 2023 at 22:51, Thomas Gleixner wrote:
> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>
> The preempt count folding is an optimization which simplifies the
> preempt_enable logic:
>
> 	if (--preempt_count && need_resched())
> 		schedule()
> to
> 	if (--preempt_count)
> 		schedule()

That should be (!(--preempt_count... in both cases of course :)
Ankur Arora Sept. 21, 2023, 12:57 a.m. UTC | #79
Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
>> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>>> Anyway, I'm definitely not opposed. We'd get rid of a config option
>>> that is presumably not very widely used, and we'd simplify a lot of
>>> issues, and get rid of all these badly defined "cond_preempt()"
>>> things.
>>
>> Hmm. Didn't I promise a year ago that I won't do further large scale
>> cleanups and simplifications beyond printk.
>>
>> Maybe I get away this time with just suggesting it. :)
>
> Maybe not. As I'm inveterate curious, I sat down and figured out how
> that might look like.
>
> To some extent I really curse my curiosity as the amount of macro maze,
> config options and convoluted mess behind all these preempt mechanisms
> is beyond disgusting.
>
> Find below a PoC which implements that scheme. It's not even close to
> correct, but it builds, boots and survives lightweight testing.

Whew, that was electric. I had barely managed to sort through some of
the config maze.
From a quick look this is pretty much how you described it.

> I did not even try to look into time-slice enforcement, but I really want
> to share this for illustration and for others to experiment.
>
> This keeps all the existing mechanisms in place and introduces a new
> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>
> If selected it builds a CONFIG_PREEMPT kernel, which disables the
> cond_resched() machinery and switches the fair scheduler class to use
> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
> the preempt NONE model except that cond_resched() is a NOOP and I did
> not validate the time-slice enforcement. The latter should be a
> no-brainer to figure out and fix if required.

Yeah, let me try this out.

Thanks
Ankur
Ankur Arora Sept. 21, 2023, 12:58 a.m. UTC | #80
Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>
>>> So the decision matrix would be:
>>>
>>>                 Ret2user        Ret2kernel      PreemptCnt=0
>>>
>>> NEED_RESCHED       Y                Y               Y
>>> LAZY_RESCHED       Y                N               N
>>>
>>> That is completely independent of the preemption model and the
>>> differentiation of the preemption models happens solely at the scheduler
>>> level:
>>
>> This is relatively minor, but do we need two flags? Seems to me we
>> can get to the same decision matrix by letting the scheduler fold
>> into the preempt-count based on current preemption model.
>
> You still need the TIF flags because there is no way to do remote
> modification of preempt count.

Yes, agreed. In my version, I was envisaging that the remote cpu always
only sets up TIF_NEED_RESCHED and then we decide which one we want at
the preemption point.

Anyway, I see what you meant in your PoC.

>>> But they support PREEMPT_COUNT, so we might get away with a reduced
>>> preemption point coverage:
>>>
>>>                 Ret2user        Ret2kernel      PreemptCnt=0
>>>
>>> NEED_RESCHED       Y                N               Y
>>> LAZY_RESCHED       Y                N               N
>>
>> So from the discussion in the other thread, for the ARCH_NO_PREEMPT
>> configs that don't support preemption, we probably need a fourth
>> preemption model, say PREEMPT_UNSAFE.
>
> As discussed they wont really notice the latency issues because the
> museum pieces are not used for anything crucial and for UM that's the
> least of the correctness worries.
>
> So no, we don't need yet another knob. We keep them chucking along and
> if they really want they can adopt to the new world order. :)

Will they chuckle along, or die trying ;)?

I grepped for "preempt_enable|preempt_disable" for all the archs and
hexagon and m68k don't seem to do any explicit accounting at all.
(Though, neither do nios2 and openrisc, and both csky and microblaze
only do it in the tlbflush path.)

        arch/hexagon      0
        arch/m68k         0
        arch/nios2        0
        arch/openrisc     0
        arch/csky         3
        arch/microblaze   3
        arch/um           4
        arch/riscv        8
        arch/arc         14
        arch/parisc      15
        arch/arm         16
        arch/sparc       16
        arch/xtensa      19
        arch/sh          21
        arch/alpha       23
        arch/ia64        27
        arch/loongarch   53
        arch/arm64       54
        arch/s390        91
        arch/mips       115
        arch/x86        146
        arch/powerpc    201

My concern is given that we preempt on timeslice expiration for all
three preemption models, we could end up preempting at an unsafe
location.

Still, not the most pressing of problems.


Thanks
--
ankur
Thomas Gleixner Sept. 21, 2023, 2:02 a.m. UTC | #81
On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> Find below a PoC which implements that scheme. It's not even close to
>> correct, but it builds, boots and survives lightweight testing.
>
> Whew, that was electric. I had barely managed to sort through some of
> the config maze.
> From a quick look this is pretty much how you described it.

Unsurpringly I spent at least 10x the time to describe it than to hack
it up.

IOW, I had done the analysis before I offered the idea and before I
changed a single line of code. The tools I used for that are git-grep,
tags, paper, pencil, accrued knowledge and patience, i.e. nothing even
close to rocket science.

Converting the analysis into code was mostly a matter of brain dumping
the analysis and adherence to accrued methodology.

What's electric about that?

I might be missing some meaning of 'electric' which is not covered by my
mostly Webster restricted old-school understanding of the english language :)

>> I did not even try to look into time-slice enforcement, but I really want
>> to share this for illustration and for others to experiment.
>>
>> This keeps all the existing mechanisms in place and introduces a new
>> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>>
>> If selected it builds a CONFIG_PREEMPT kernel, which disables the
>> cond_resched() machinery and switches the fair scheduler class to use
>> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
>> the preempt NONE model except that cond_resched() is a NOOP and I did
>> not validate the time-slice enforcement. The latter should be a
>> no-brainer to figure out and fix if required.
>
> Yeah, let me try this out.

That's what I hoped for :)

Thanks,

        tglx
Thomas Gleixner Sept. 21, 2023, 2:12 a.m. UTC | #82
On Wed, Sep 20 2023 at 17:58, Ankur Arora wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> So no, we don't need yet another knob. We keep them chucking along and
>> if they really want they can adopt to the new world order. :)
>
> Will they chuckle along, or die trying ;)?

Either way is fine :)

> I grepped for "preempt_enable|preempt_disable" for all the archs and
> hexagon and m68k don't seem to do any explicit accounting at all.
> (Though, neither do nios2 and openrisc, and both csky and microblaze
> only do it in the tlbflush path.)
>
>         arch/hexagon      0
>         arch/m68k         0
...
>         arch/s390        91
>         arch/mips       115
>         arch/x86        146
>         arch/powerpc    201
>
> My concern is given that we preempt on timeslice expiration for all
> three preemption models, we could end up preempting at an unsafe
> location.

As I said in my reply to Linus, that count is not really conclusive.

arch/m68k has a count of 0 and supports PREEMPT for the COLDFIRE
sub-architecture and I know for sure that at some point in the past
PREEMPT_RT was supported on COLDFIRE with minimal changes to the
architecture code.

That said, I'm pretty sure that quite some of these
preempt_disable/enable pairs in arch/* are subject to voodoo
programming, but that's a different problem to analyze.

> Still, not the most pressing of problems.

Exactly :)

Thanks,

        tglx
Ankur Arora Sept. 21, 2023, 4:16 a.m. UTC | #83
Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>> Find below a PoC which implements that scheme. It's not even close to
>>> correct, but it builds, boots and survives lightweight testing.
>>
>> Whew, that was electric. I had barely managed to sort through some of
>> the config maze.
>> From a quick look this is pretty much how you described it.
>
> Unsurpringly I spent at least 10x the time to describe it than to hack
> it up.
>
> IOW, I had done the analysis before I offered the idea and before I
> changed a single line of code. The tools I used for that are git-grep,
> tags, paper, pencil, accrued knowledge and patience, i.e. nothing even
> close to rocket science.
>
> Converting the analysis into code was mostly a matter of brain dumping
> the analysis and adherence to accrued methodology.
>
> What's electric about that?

Hmm, so I /think/ I was going for something like electric current taking
the optimal path, with a side meaning of electrifying.

Though, I guess electron flow is a quantum mechanical, so that would
really try all possible paths, which means the analogy doesn't quite
fit.

Let me substitute greased lightning for electric :D.

> I might be missing some meaning of 'electric' which is not covered by my
> mostly Webster restricted old-school understanding of the english language :)
>
>>> I did not even try to look into time-slice enforcement, but I really want
>>> to share this for illustration and for others to experiment.
>>>
>>> This keeps all the existing mechanisms in place and introduces a new
>>> config knob in the preemption model Kconfig switch: PREEMPT_AUTO
>>>
>>> If selected it builds a CONFIG_PREEMPT kernel, which disables the
>>> cond_resched() machinery and switches the fair scheduler class to use
>>> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to
>>> the preempt NONE model except that cond_resched() is a NOOP and I did
>>> not validate the time-slice enforcement. The latter should be a
>>> no-brainer to figure out and fix if required.
>>
>> Yeah, let me try this out.
>
> That's what I hoped for :)

:).

Quick update: it hasn't eaten my hard disk yet. Both the "none" and
"full" variants are stable with kbuild.

Next: time-slice validation, any fixes and then maybe alarm-clock
deployments.

And, then if you are okay with it, I'll cleanup/structure your patches
together with all the other preemption cleanups we discussed into an
RFC series.

(One thing I can't wait to measure is how many cond_resched() calls
and associated dynamic instructions do we not execute now.
Not because I think it really matters for performance -- though it might
on low IPC archs, but just it'll be a relief seeing the cond_resched()
gone for real.)

--
ankur
Arnd Bergmann Sept. 21, 2023, 12:20 p.m. UTC | #84
On Tue, Sep 19, 2023, at 10:16, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote:
>> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote:
>> > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff
>> > > that people are still working on! Can we please not do this?
>> > 
>> > If you're working on one of them, then surely it's a simple matter of
>> > working on adding CONFIG_PREEMPT support :-)
>> 
>> As Geert poined out, I'm not seeing anything particular problematic with the
>> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more
>> something about organizing KConfig files.
>
> The plan in the parent thread is to remove PREEMPT_NONE and
> PREEMPT_VOLUNTARY and only keep PREEMPT_FULL.
...
>
> PREEMPT isn't something new. Also, I don't think the arch part for
> actually supporting it is particularly hard, mostly it is sticking the
> preempt_schedule_irq() call in return from interrupt code path.
>
> If you convert the arch to generic-entry (a much larger undertaking)
> then you get this for free.

I checked the default configurations for both in-kernel targets and
general-purpose distros and was surprised to learn that very few
actually turn on full preemption by default:

- All distros I looked at (rhel, debian, opensuse) use PREEMPT_VOLUNTARY
  by default, though they usually also set PREEMPT_DYNAMIC to let users
  override it at boot time.

- The majority (220) of all defconfig files in the kernel don't select
  any preemption options, and just get PREEMPT_NONE automatically.
  This includes the generic configs for armv7, s390 and mips.
  
- A small number (24) set PREEMPT_VOLUNTARY, but this notably
  includes x86 and ppc64. x86 is the only one of those that sets
  PREEMPT_DYNAMIC

- CONFIG_PREEMPT=y (full preemption) is used on 89 defconfigs,
  including arm64 and a lot of the older arm32, arc and
  mips platforms.

If we want to have a chance of removing both PREEMPT_NONE and
PREEMPT_VOLUNTARY, I think we should start with changing the
defaults first, so defconfigs that don't specify anything else
get PREEMPT=y, and distros that use PREEMPT_VOLUNTARY use it
use it in the absence of a command line argument. If that
doesn't cause too many regressions, the next step might be
to hide the choice under CONFIG_EXPERT until all m68k and
alpha no longer require PREEMPT_NONE.

     Arnd
Steven Rostedt Sept. 21, 2023, 1:59 p.m. UTC | #85
On Wed, 20 Sep 2023 21:16:17 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> > On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote:  
> >> Thomas Gleixner <tglx@linutronix.de> writes:  
> >>> Find below a PoC which implements that scheme. It's not even close to
> >>> correct, but it builds, boots and survives lightweight testing.  
> >>
> >> Whew, that was electric. I had barely managed to sort through some of
> >> the config maze.
> >> From a quick look this is pretty much how you described it.  
> >


> > What's electric about that?  
> 
> Hmm, so I /think/ I was going for something like electric current taking
> the optimal path, with a side meaning of electrifying.
> 
> Though, I guess electron flow is a quantum mechanical, so that would
> really try all possible paths, which means the analogy doesn't quite
> fit.
> 
> Let me substitute greased lightning for electric :D.

"It's electrifying!" ;-)

  https://www.youtube.com/watch?v=7oKPYe53h78

-- Steve
Linus Torvalds Sept. 21, 2023, 4 p.m. UTC | #86
Ok, I like this.

That said, this part of it:

On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int nr_bit)
>  [...]
> -               set_tsk_need_resched(curr);
> -               set_preempt_need_resched();
> +               set_tsk_thread_flag(curr, nr_bit);
> +               if (nr_bit == TIF_NEED_RESCHED)
> +                       set_preempt_need_resched();

feels really hacky.

I think that instead of passing a random TIF bit around, it should
just pass a "lazy or not" value around.

Then you make the TIF bit be some easily computable thing (eg something like

        #define TIF_RESCHED(lazy) (TIF_NEED_RESCHED + (lazy))

or whatever), and write the above conditional as

        if (!lazy)
                set_preempt_need_resched();

so that it all *does* the same thing, but the code makes it clear
about what the logic is.

Because honestly, without having been part of this thread, I would look at that

        if (nr_bit == TIF_NEED_RESCHED)
                set_preempt_need_resched();

and I'd be completely lost. It doesn't make conceptual sense, I feel.

So I'd really like the source code to be more directly expressing the
*intent* of the code, not be so centered around the implementation
detail.

Put another way: I think we can make the compiler turn the intent into
the implementation, and I'd rather *not* have us humans have to infer
the intent from the implementation.

That said - I think as a proof of concept and "look, with this we get
the expected scheduling event counts", that patch is perfect. I think
you more than proved the concept.

                 Linus
Thomas Gleixner Sept. 21, 2023, 10:55 p.m. UTC | #87
Linus!

On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> Ok, I like this.

Thanks!

> That said, this part of it:

> On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <tglx@linutronix.de> wrote:

> Because honestly, without having been part of this thread, I would look at that
>
>         if (nr_bit == TIF_NEED_RESCHED)
>                 set_preempt_need_resched();
>
> and I'd be completely lost. It doesn't make conceptual sense, I feel.
>
> So I'd really like the source code to be more directly expressing the
> *intent* of the code, not be so centered around the implementation
> detail.
>
> Put another way: I think we can make the compiler turn the intent into
> the implementation, and I'd rather *not* have us humans have to infer
> the intent from the implementation.

No argument about that. I didn't like it either, but at 10PM ...

> That said - I think as a proof of concept and "look, with this we get
> the expected scheduling event counts", that patch is perfect. I think
> you more than proved the concept.

There is certainly quite some analyis work to do to make this a one to
one replacement.

With a handful of benchmarks the PoC (tweaked with some obvious fixes)
is pretty much on par with the current mainline variants (NONE/FULL),
but the memtier benchmark makes a massive dent.

It sports a whopping 10% regression with the LAZY mode versus the mainline
NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.

That benchmark is really sensitive to the preemption model. With current
mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
performance drop versus preempt=NONE.

I have no clue what's going on there yet, but that shows that there is
obviously quite some work ahead to get this sorted.

Though I'm pretty convinced by now that this is the right direction and
well worth the effort which needs to be put into that.

Thanks,

        tglx
Thomas Gleixner Sept. 23, 2023, 1:11 a.m. UTC | #88
On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> That said - I think as a proof of concept and "look, with this we get
>> the expected scheduling event counts", that patch is perfect. I think
>> you more than proved the concept.
>
> There is certainly quite some analyis work to do to make this a one to
> one replacement.
>
> With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> is pretty much on par with the current mainline variants (NONE/FULL),
> but the memtier benchmark makes a massive dent.
>
> It sports a whopping 10% regression with the LAZY mode versus the mainline
> NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>
> That benchmark is really sensitive to the preemption model. With current
> mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> performance drop versus preempt=NONE.

That 20% was a tired pilot error. The real number is in the 5% ballpark.

> I have no clue what's going on there yet, but that shows that there is
> obviously quite some work ahead to get this sorted.

It took some head scratching to figure that out. The initial fix broke
the handling of the hog issue, i.e. the problem that Ankur tried to
solve, but I hacked up a "solution" for that too.

With that the memtier benchmark is roughly back to the mainline numbers,
but my throughput benchmark know how is pretty close to zero, so that
should be looked at by people who actually understand these things.

Likewise the hog prevention is just at the PoC level and clearly beyond
my knowledge of scheduler details: It unconditionally forces a
reschedule when the looping task is not responding to a lazy reschedule
request before the next tick. IOW it forces a reschedule on the second
tick, which is obviously different from the cond_resched()/might_sleep()
behaviour.

The changes vs. the original PoC aside of the bug and thinko fixes:

    1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
       lazy preempt bit as the trace_entry::flags field is full already.

       That obviously breaks the tracer ABI, but if we go there then
       this needs to be fixed. Steven?

    2) debugfs file to validate that loops can be force preempted w/o
       cond_resched()

       The usage is:

       # taskset -c 1 bash
       # echo 1 > /sys/kernel/debug/sched/hog &
       # echo 1 > /sys/kernel/debug/sched/hog &
       # echo 1 > /sys/kernel/debug/sched/hog &

       top shows ~33% CPU for each of the hogs and tracing confirms that
       the crude hack in the scheduler tick works:

            bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
            bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
            bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
            bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
            bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr

       The 'l' instead of the usual 'N' reflects that the lazy resched
       bit is set. That makes __update_curr() invoke resched_curr()
       instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
       and folds it into preempt_count so that preemption happens at the
       next possible point, i.e. either in return from interrupt or at
       the next preempt_enable().

That's as much as I wanted to demonstrate and I'm not going to spend
more cycles on it as I have already too many other things on flight and
the resulting scheduler woes are clearly outside of my expertice.

Though definitely I'm putting a permanent NAK in place for any attempts
to duct tape the preempt=NONE model any further by sprinkling more
cond*() and whatever warts around.

Thanks,

        tglx
---
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/thread_info.h |    6 ++--
 drivers/acpi/processor_idle.c      |    2 -
 include/linux/entry-common.h       |    2 -
 include/linux/entry-kvm.h          |    2 -
 include/linux/sched.h              |   12 +++++---
 include/linux/sched/idle.h         |    8 ++---
 include/linux/thread_info.h        |   24 +++++++++++++++++
 include/linux/trace_events.h       |    8 ++---
 kernel/Kconfig.preempt             |   17 +++++++++++-
 kernel/entry/common.c              |    4 +-
 kernel/entry/kvm.c                 |    2 -
 kernel/sched/core.c                |   51 +++++++++++++++++++++++++------------
 kernel/sched/debug.c               |   19 +++++++++++++
 kernel/sched/fair.c                |   46 ++++++++++++++++++++++-----------
 kernel/sched/features.h            |    2 +
 kernel/sched/idle.c                |    3 --
 kernel/sched/sched.h               |    1 
 kernel/trace/trace.c               |    2 +
 kernel/trace/trace_output.c        |   16 ++++++++++-
 20 files changed, 171 insertions(+), 57 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
 
 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
  * spurious IPIs.
  */
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
 {
 	struct thread_info *ti = task_thread_info(p);
-	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+
+	return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
 }
 
 /*
@@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
 	for (;;) {
 		if (!(val & _TIF_POLLING_NRFLAG))
 			return false;
-		if (val & _TIF_NEED_RESCHED)
+		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			return true;
 		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
 			break;
@@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
 }
 
 #else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
 {
-	set_tsk_need_resched(p);
+	set_tsk_thread_flag(p, tif_bit);
 	return true;
 }
 
@@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
  * might also involve a cross-CPU call to trigger the scheduler on
  * the target CPU.
  */
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int lazy)
 {
+	int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
 	struct task_struct *curr = rq->curr;
-	int cpu;
 
 	lockdep_assert_rq_held(rq);
 
-	if (test_tsk_need_resched(curr))
+	if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr);
-		set_preempt_need_resched();
+		set_tsk_thread_flag(curr, tif_bit);
+		if (!lazy)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(curr, tif_bit)) {
+		if (!lazy)
+			smp_send_reschedule(cpu);
+	} else {
 		trace_sched_wake_idle_without_ipi(cpu);
+	}
+}
+
+void resched_curr(struct rq *rq)
+{
+	__resched_curr(rq, 0);
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+	int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
+		TIF_NEED_RESCHED_LAZY_OFFSET : 0;
+
+	if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
+		return;
+
+	__resched_curr(rq, lazy);
 }
 
 void resched_cpu(int cpu)
@@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	if (set_nr_and_not_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
@@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
 		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
 		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
 	}									 \
-	EXPORT_SYMBOL_GPL(preempt_model_##mode)
 
 PREEMPT_MODEL_ACCESSOR(none);
 PREEMPT_MODEL_ACCESSOR(voluntary);
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,16 @@ enum syscall_work_bit {
 
 #include <asm/thread_info.h>
 
+#ifdef CONFIG_PREEMPT_AUTO
+# define TIF_NEED_RESCHED_LAZY		TIF_ARCH_RESCHED_LAZY
+# define _TIF_NEED_RESCHED_LAZY		_TIF_ARCH_RESCHED_LAZY
+# define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
+#else
+# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
+# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
+# define TIF_NEED_RESCHED_LAZY_OFFSET	0
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
@@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
 			     (unsigned long *)(&current_thread_info()->flags));
 }
 
+static __always_inline bool tif_need_resched_lazy(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+		arch_test_bit(TIF_NEED_RESCHED_LAZY,
+			      (unsigned long *)(&current_thread_info()->flags));
+}
+
 #else
 
 static __always_inline bool tif_need_resched(void)
@@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
 			(unsigned long *)(&current_thread_info()->flags));
 }
 
+static __always_inline bool tif_need_resched_lazy(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
+		test_bit(TIF_NEED_RESCHED_LAZY,
+			 (unsigned long *)(&current_thread_info()->flags));
+}
+
 #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
 
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,13 @@ config PREEMPT_BUILD
 	select PREEMPTION
 	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
 
+config PREEMPT_BUILD_AUTO
+	bool
+	select PREEMPT_BUILD
+
+config HAVE_PREEMPT_AUTO
+	bool
+
 choice
 	prompt "Preemption Model"
 	default PREEMPT_NONE
@@ -67,9 +74,17 @@ config PREEMPT
 	  embedded system with latency requirements in the milliseconds
 	  range.
 
+config PREEMPT_AUTO
+	bool "Automagic preemption mode with runtime tweaking support"
+	depends on HAVE_PREEMPT_AUTO
+	select PREEMPT_BUILD_AUTO
+	help
+	  Add some sensible blurb here
+
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
 	depends on EXPERT && ARCH_SUPPORTS_RT
+	select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
 	select PREEMPTION
 	help
 	  This option turns the kernel into a real-time kernel by replacing
@@ -95,7 +110,7 @@ config PREEMPTION
 
 config PREEMPT_DYNAMIC
 	bool "Preemption behaviour defined on boot"
-	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
 	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
 	select PREEMPT_BUILD
 	default y if HAVE_PREEMPT_DYNAMIC_CALL
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -60,7 +60,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_UPROBE)
@@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
 		rcu_irq_exit_check_preempt();
 		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched())
+		if (test_tsk_need_resched(current))
 			preempt_schedule_irq();
 	}
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT(FORCE_NEED_RESCHED, false)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
 extern void reweight_task(struct task_struct *p, int prio);
 
 extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
 	update_ti_thread_flag(task_thread_info(tsk), flag, value);
 }
 
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
@@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
 }
 
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
 {
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
@@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
 
 static __always_inline bool need_resched(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(tif_need_resched_lazy() || tif_need_resched());
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
 {
+	struct rq *rq = rq_of(cfs_rq);
+
 	if ((s64)(se->vruntime - se->deadline) < 0)
 		return;
 
@@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
-		clear_buddies(cfs_rq, se);
+	if (cfs_rq->nr_running < 2)
+		return;
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
+		resched_curr(rq);
+	} else {
+		/* Did the task ignore the lazy reschedule request? */
+		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+			resched_curr(rq);
+		else
+			resched_curr_lazy(rq);
 	}
+	clear_buddies(cfs_rq, se);
 }
 
 #include "pelt.h"
@@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
 /*
  * Update the current task's runtime statistics.
  */
-static void update_curr(struct cfs_rq *cfs_rq)
+static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
@@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	update_deadline(cfs_rq, curr, tick);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+static inline void update_curr(struct cfs_rq *cfs_rq)
+{
+	__update_curr(cfs_rq, false);
+}
+
 static void update_curr_fair(struct rq *rq)
 {
 	update_curr(cfs_rq_of(&rq->curr->se));
@@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
-	update_curr(cfs_rq);
+	__update_curr(cfs_rq, true);
 
 	/*
 	 * Ensure that runnable average is periodically updated.
@@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 		return;
 	}
 	/*
@@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
 	 * hierarchy can be throttled
 	 */
 	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 }
 
 static __always_inline
@@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
-		resched_curr(rq);
+		resched_curr_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
@@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
 
 		if (delta < 0) {
 			if (task_current(rq, p))
-				resched_curr(rq);
+				resched_curr_lazy(rq);
 			return;
 		}
 		hrtick_start(rq, delta);
@@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (need_resched())
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
 	return;
 
 preempt:
-	resched_curr(rq);
+	resched_curr_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
@@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
 	 */
 	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
-		resched_curr(rq);
+		resched_curr_lazy(rq);
 }
 
 /*
@@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
 	 */
 	if (task_current(rq, p)) {
 		if (p->prio > oldprio)
-			resched_curr(rq);
+			resched_curr_lazy(rq);
 	} else
 		check_preempt_curr(rq, p, 0);
 }
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -108,7 +108,7 @@ static const struct dmi_system_id proces
  */
 static void __cpuidle acpi_safe_halt(void)
 {
-	if (!tif_need_resched()) {
+	if (!need_resched()) {
 		raw_safe_halt();
 		raw_local_irq_disable();
 	}
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -63,7 +63,7 @@ static __always_inline bool __must_check
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 static __always_inline bool __must_check current_clr_polling_and_test(void)
@@ -76,7 +76,7 @@ static __always_inline bool __must_check
 	 */
 	smp_mb__after_atomic();
 
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 
 #else
@@ -85,11 +85,11 @@ static inline void __current_clr_polling
 
 static inline bool __must_check current_set_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 static inline bool __must_check current_clr_polling_and_test(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(need_resched());
 }
 #endif
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
 	ct_cpuidle_enter();
 
 	raw_local_irq_enable();
-	while (!tif_need_resched() &&
-	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
+	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
 		cpu_relax();
 	raw_local_irq_disable();
 
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
 
 	if (tif_need_resched())
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
+	if (tif_need_resched_lazy())
+		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
 	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -271,6 +271,7 @@ config X86
 	select HAVE_STATIC_CALL
 	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
 	select HAVE_PREEMPT_DYNAMIC_CALL
+	select HAVE_PREEMPT_AUTO
 	select HAVE_RSEQ
 	select HAVE_RUST			if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,8 +81,9 @@ struct thread_info {
 #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
-#define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_ARCH_RESCHED_LAZY	4	/* Lazy rescheduling */
+#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
+#define TIF_SSBD		6	/* Speculative store bypass disable */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -104,6 +105,7 @@ struct thread_info {
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
+#define _TIF_ARCH_RESCHED_LAZY	(1 << TIF_ARCH_RESCHED_LAZY)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
 			return -EINTR;
 		}
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
 
 enum trace_flag_type {
 	TRACE_FLAG_IRQS_OFF		= 0x01,
-	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
-	TRACE_FLAG_NEED_RESCHED		= 0x04,
+	TRACE_FLAG_NEED_RESCHED		= 0x02,
+	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
 	TRACE_FLAG_HARDIRQ		= 0x08,
 	TRACE_FLAG_SOFTIRQ		= 0x10,
 	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
@@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
 
 static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 static inline unsigned int tracing_gen_ctx(void)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 #endif
 
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
 		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
 		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
 		bh_off ? 'b' :
-		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
 		'.';
 
-	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
 				TRACE_FLAG_PREEMPT_RESCHED)) {
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'B';
+		break;
 	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'N';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'L';
+		break;
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'b';
+		break;
 	case TRACE_FLAG_NEED_RESCHED:
 		need_resched = 'n';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'l';
+		break;
 	case TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'p';
 		break;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,6 +333,23 @@ static const struct file_operations sche
 	.release	= seq_release,
 };
 
+static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	unsigned long end = jiffies + 60 * HZ;
+
+	for (; time_before(jiffies, end) && !signal_pending(current);)
+		cpu_relax();
+
+	return cnt;
+}
+
+static const struct file_operations sched_hog_fops = {
+	.write		= sched_hog_write,
+	.open		= simple_open,
+	.llseek		= default_llseek,
+};
+
 static struct dentry *debugfs_sched;
 
 static __init int sched_init_debug(void)
@@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
+
 	return 0;
 }
 late_initcall(sched_init_debug);
Thomas Gleixner Sept. 23, 2023, 10:50 p.m. UTC | #89
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>> Then the question becomes whether we'd want to introduce a *new*
>> concept, which is a "if you are going to schedule, do it now rather
>> than later, because I'm taking a lock, and while it's a preemptible
>> lock, I'd rather not sleep while holding this resource".
>>
>> I suspect we want to avoid that for now, on the assumption that it's
>> hopefully not a problem in practice (the recently addressed problem
>> with might_sleep() was that it actively *moved* the scheduling point
>> to a bad place, not that scheduling could happen there, so instead of
>> optimizing scheduling, it actively pessimized it). But I thought I'd
>> mention it.
>
> I think we want to avoid that completely and if this becomes an issue,
> we rather be smart about it at the core level.
>
> It's trivial enough to have a per task counter which tells whether a
> preemtible lock is held (or about to be acquired) or not. Then the
> scheduler can take that hint into account and decide to grant a
> timeslice extension once in the expectation that the task leaves the
> lock held section soonish and either returns to user space or schedules
> out. It still can enforce it later on.
>
> We really want to let the scheduler decide and rather give it proper
> hints at the conceptual level instead of letting developers make random
> decisions which might work well for a particular use case and completely
> suck for the rest. I think we wasted enough time already on those.

Finally I realized why cond_resched() & et al. are so disgusting. They
are scope-less and just a random spot which someone decided to be a good
place to reschedule.

But in fact the really relevant measure is scope. Full preemption is
scope based:

      preempt_disable();
      do_stuff();
      preempt_enable();

which also nests properly:

      preempt_disable();
      do_stuff()
        preempt_disable();
        do_other_stuff();
        preempt_enable();
      preempt_enable();

cond_resched() cannot nest and is obviously scope-less.

The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
pretends to be scoped.

As Peter pointed out it does not properly nest with other mechanisms and
it cannot even nest in itself because it is boolean.

The worst thing about it is that it is semantically reverse to the
established model of preempt_disable()/enable(),
i.e. allow_resched()/disallow_resched().

So instead of giving the scheduler a hint about 'this might be a good
place to preempt', providing proper scope would make way more sense:

      preempt_lazy_disable();
      do_stuff();
      preempt_lazy_enable();

That would be the obvious and semantically consistent counterpart to the
existing preemption control primitives with proper nesting support.

might_sleep(), which is in all the lock acquire functions or your
variant of hint (resched better now before I take the lock) are the
wrong place.

     hint();
     lock();
     do_stuff();
     unlock();

hint() might schedule and when the task comes back schedule immediately
again because the lock is contended. hint() does again not have scope
and might be meaningless or even counterproductive if called in a deeper
callchain.

Proper scope based hints avoid that.

      preempt_lazy_disable();
      lock();
      do_stuff();
      unlock();
      preempt_lazy_enable();
      
That's way better because it describes the scope and the task will
either schedule out in lock() on contention or provide a sensible lazy
preemption point in preempt_lazy_enable(). It also nests properly:

      preempt_lazy_disable();
      lock(A);
      do_stuff()
        preempt_lazy_disable();
        lock(B);
        do_other_stuff();
        unlock(B);
        preempt_lazy_enable();
      unlock(A);
      preempt_lazy_enable();

So in this case it does not matter wheter do_stuff() is invoked from a
lock held section or not. The scope which defines the throughput
relevant hint to the scheduler is correct in any case.

Contrary to preempt_disable() the lazy variant does neither prevent
scheduling nor preemption, but its a understandable properly nestable
mechanism.

I seriously hope to avoid it alltogether :)

Thanks,

        tglx
Thomas Gleixner Sept. 24, 2023, 12:10 a.m. UTC | #90
On Sun, Sep 24 2023 at 00:50, Thomas Gleixner wrote:
> On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
> That's way better because it describes the scope and the task will
> either schedule out in lock() on contention or provide a sensible lazy
> preemption point in preempt_lazy_enable(). It also nests properly:
>
>       preempt_lazy_disable();
>       lock(A);
>       do_stuff()
>         preempt_lazy_disable();
>         lock(B);
>         do_other_stuff();
>         unlock(B);
>         preempt_lazy_enable();
>       unlock(A);
>       preempt_lazy_enable();
>
> So in this case it does not matter wheter do_stuff() is invoked from a
> lock held section or not. The scope which defines the throughput
> relevant hint to the scheduler is correct in any case.

Which also means that automatically injecting it into lock primitives
makes suddenly sense in the same way as the implicit preempt_disable()
in the rw/spinlock primitives does.

Thanks,

        tglx
Matthew Wilcox Sept. 24, 2023, 7:19 a.m. UTC | #91
On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote:
> cond_resched() cannot nest and is obviously scope-less.
> 
> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
> pretends to be scoped.
> 
> As Peter pointed out it does not properly nest with other mechanisms and
> it cannot even nest in itself because it is boolean.

We can nest a single bit without turning it into a counter -- we
do this for memalloc_nofs_save() for example.  Simply return the
current value of the bit, and pass it to _restore().

eg xfs_prepare_ioend():

        /*
         * We can allocate memory here while doing writeback on behalf of
         * memory reclaim.  To avoid memory allocation deadlocks set the
         * task-wide nofs context for the following operations.
         */
        nofs_flag = memalloc_nofs_save();

        /* Convert CoW extents to regular */
        if (!status && (ioend->io_flags & IOMAP_F_SHARED)) {
                status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
                                ioend->io_offset, ioend->io_size);
        }

        memalloc_nofs_restore(nofs_flag);

I like your other approach better, but just in case anybody starts
worrying about turning a bit into a counter, there's no need to do
that.
Thomas Gleixner Sept. 24, 2023, 7:55 a.m. UTC | #92
On Sun, Sep 24 2023 at 08:19, Matthew Wilcox wrote:
> On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote:
>> cond_resched() cannot nest and is obviously scope-less.
>> 
>> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
>> pretends to be scoped.
>> 
>> As Peter pointed out it does not properly nest with other mechanisms and
>> it cannot even nest in itself because it is boolean.
>
> We can nest a single bit without turning it into a counter -- we
> do this for memalloc_nofs_save() for example.  Simply return the
> current value of the bit, and pass it to _restore().

Right.

That works, but the reverse logic still does not make sense:

        allow_resched();
        ....
        spin_lock();

while
        resched_now_is_suboptimal();
        ...
        spin_lock();

works.

Thanks,

        tglx
Matthew Wilcox Sept. 24, 2023, 10:29 a.m. UTC | #93
On Sun, Sep 24, 2023 at 09:55:52AM +0200, Thomas Gleixner wrote:
> On Sun, Sep 24 2023 at 08:19, Matthew Wilcox wrote:
> > On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote:
> >> cond_resched() cannot nest and is obviously scope-less.
> >> 
> >> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
> >> pretends to be scoped.
> >> 
> >> As Peter pointed out it does not properly nest with other mechanisms and
> >> it cannot even nest in itself because it is boolean.
> >
> > We can nest a single bit without turning it into a counter -- we
> > do this for memalloc_nofs_save() for example.  Simply return the
> > current value of the bit, and pass it to _restore().
> 
> Right.
> 
> That works, but the reverse logic still does not make sense:
> 
>         allow_resched();
>         ....
>         spin_lock();
> 
> while
>         resched_now_is_suboptimal();
>         ...
>         spin_lock();
> 
> works.

Oh, indeed.  I had in mind

	state = resched_now_is_suboptimal();
	spin_lock();
	...
	spin_unlock();
	resched_might_be_optimal_again(state);

... or we could bundle it up ...

	state = spin_lock_resched_disable();
	...
	spin_unlock_resched_restore(state);
Ankur Arora Sept. 25, 2023, 12:13 a.m. UTC | #94
Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote:
>> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote:
>>> Then the question becomes whether we'd want to introduce a *new*
>>> concept, which is a "if you are going to schedule, do it now rather
>>> than later, because I'm taking a lock, and while it's a preemptible
>>> lock, I'd rather not sleep while holding this resource".
>>>
>>> I suspect we want to avoid that for now, on the assumption that it's
>>> hopefully not a problem in practice (the recently addressed problem
>>> with might_sleep() was that it actively *moved* the scheduling point
>>> to a bad place, not that scheduling could happen there, so instead of
>>> optimizing scheduling, it actively pessimized it). But I thought I'd
>>> mention it.
>>
>> I think we want to avoid that completely and if this becomes an issue,
>> we rather be smart about it at the core level.
>>
>> It's trivial enough to have a per task counter which tells whether a
>> preemtible lock is held (or about to be acquired) or not. Then the
>> scheduler can take that hint into account and decide to grant a
>> timeslice extension once in the expectation that the task leaves the
>> lock held section soonish and either returns to user space or schedules
>> out. It still can enforce it later on.
>>
>> We really want to let the scheduler decide and rather give it proper
>> hints at the conceptual level instead of letting developers make random
>> decisions which might work well for a particular use case and completely
>> suck for the rest. I think we wasted enough time already on those.
>
> Finally I realized why cond_resched() & et al. are so disgusting. They
> are scope-less and just a random spot which someone decided to be a good
> place to reschedule.
>
> But in fact the really relevant measure is scope. Full preemption is
> scope based:
>
>       preempt_disable();
>       do_stuff();
>       preempt_enable();
>
> which also nests properly:
>
>       preempt_disable();
>       do_stuff()
>         preempt_disable();
>         do_other_stuff();
>         preempt_enable();
>       preempt_enable();
>
> cond_resched() cannot nest and is obviously scope-less.

That's true. Though, I would argue that another way to look at cond_resched()
might be that it summarizes two kinds of state. First, the timer/resched
activity that might cause you to schedule. The second, as an annotation from
the programmer summarizing their understanding of the state of the execution
stack and that there are no resources held across the current point.

The second is, as you say, hard to get right -- because there's no clear
definition of what it means for us to get it right, resulting in random
placement of cond_rescheds() until latency improves.
In any case this summary of execution state is done better by just always
tracking preemption scope.

> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only
> pretends to be scoped.
>
> As Peter pointed out it does not properly nest with other mechanisms and
> it cannot even nest in itself because it is boolean.
>
> The worst thing about it is that it is semantically reverse to the
> established model of preempt_disable()/enable(),
> i.e. allow_resched()/disallow_resched().

Can't disagree with that. In part it was that way because I was trying
to provide an alternative to cond_resched() while executing in a particular
preemptible scope -- except for not actually having any real notion of scoping.

>
> So instead of giving the scheduler a hint about 'this might be a good
> place to preempt', providing proper scope would make way more sense:
>
>       preempt_lazy_disable();
>       do_stuff();
>       preempt_lazy_enable();
>
> That would be the obvious and semantically consistent counterpart to the
> existing preemption control primitives with proper nesting support.
>
> might_sleep(), which is in all the lock acquire functions or your
> variant of hint (resched better now before I take the lock) are the
> wrong place.
>
>      hint();
>      lock();
>      do_stuff();
>      unlock();
>
> hint() might schedule and when the task comes back schedule immediately
> again because the lock is contended. hint() does again not have scope
> and might be meaningless or even counterproductive if called in a deeper
> callchain.

Perhaps another problem is that some of these hints are useful for two
different things: as an annotation about the state of execution, and
also as a hint to the scheduler.

For instance, this fix that Linus pointed to a few days ago:
4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()").

is using might_sleep() in the first sense.

Thanks

--
ankur
Steven Rostedt Oct. 2, 2023, 2:15 p.m. UTC | #95
On Sat, 23 Sep 2023 03:11:05 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Though definitely I'm putting a permanent NAK in place for any attempts
> to duct tape the preempt=NONE model any further by sprinkling more
> cond*() and whatever warts around.

Well, until we have this fix in, we will still need to sprinkle those
around when they are triggering watchdog timeouts. I just had to add one
recently due to a timeout report :-(




> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>  
>  	if (tif_need_resched())
>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
> +	if (tif_need_resched_lazy())
> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>  	if (test_preempt_need_resched())
>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |

> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>  
>  enum trace_flag_type {
>  	TRACE_FLAG_IRQS_OFF		= 0x01,
> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,

I never cared for that NOSUPPORT flag. It's from 2008 and only used by
archs that do not support irq tracing (aka lockdep). I'm fine with dropping
it and just updating the user space libraries (which will no longer see it
not supported, but that's fine with me).

> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,

Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT?
Because, NEED_RESCHED is known, and moving that to bit 2 will break user
space. Having LAZY replace the IRQS_NOSUPPORT will cause the least
"breakage".

-- Steve

>  	TRACE_FLAG_HARDIRQ		= 0x08,
>  	TRACE_FLAG_SOFTIRQ		= 0x10,
>  	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>  
>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  static inline unsigned int tracing_gen_ctx(void)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  #endif
>  
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>  		bh_off ? 'b' :
> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>  		'.';
>  
> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>  				TRACE_FLAG_PREEMPT_RESCHED)) {
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'B';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'N';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'L';
> +		break;
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'b';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED:
>  		need_resched = 'n';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'l';
> +		break;
>  	case TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'p';
>  		break;
Thomas Gleixner Oct. 2, 2023, 4:13 p.m. UTC | #96
On Mon, Oct 02 2023 at 10:15, Steven Rostedt wrote:
> On Sat, 23 Sep 2023 03:11:05 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> Though definitely I'm putting a permanent NAK in place for any attempts
>> to duct tape the preempt=NONE model any further by sprinkling more
>> cond*() and whatever warts around.
>
> Well, until we have this fix in, we will still need to sprinkle those
> around when they are triggering watchdog timeouts. I just had to add one
> recently due to a timeout report :-(

cond_resched() sure. But not new flavours of it, like the
[dis]allow_resched() which sparked this discussion.

>> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
>> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
>> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
>
> Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT?
> Because, NEED_RESCHED is known, and moving that to bit 2 will break user
> space. Having LAZY replace the IRQS_NOSUPPORT will cause the least
> "breakage".

Either way works for me.

Thanks,

        tglx
Geert Uytterhoeven Oct. 6, 2023, 1:01 p.m. UTC | #97
Hi Thomas,

On Tue, Sep 19, 2023 at 9:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> Though it just occured to me that there are dragons lurking:
>
> arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> arch/um/Kconfig:        select ARCH_NO_PREEMPT
>
> So we have four architectures which refuse to enable preemption points,
> i.e. the only model they allow is NONE and they rely on cond_resched()
> for breaking large computations.

Looks like there is a fifth one hidden: although openrisc does not
select ARCH_NO_PREEMPT, it does not call preempt_schedule_irq() or
select GENERIC_ENTRY?

Gr{oetje,eeting}s,

                        Geert
Geert Uytterhoeven Oct. 6, 2023, 2:51 p.m. UTC | #98
Hi Willy,

On Tue, Sep 19, 2023 at 3:01 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote:
> > Though it just occured to me that there are dragons lurking:
> >
> > arch/alpha/Kconfig:     select ARCH_NO_PREEMPT
> > arch/hexagon/Kconfig:   select ARCH_NO_PREEMPT
> > arch/m68k/Kconfig:      select ARCH_NO_PREEMPT if !COLDFIRE
> > arch/um/Kconfig:        select ARCH_NO_PREEMPT
>
> Sounds like three-and-a-half architectures which could be queued up for
> removal right behind ia64 ...
>
> I suspect none of these architecture maintainers have any idea there's a
> problem.  Look at commit 87a4c375995e and the discussion in
> https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/
>
> Let's cc those maintainers so they can remove this and fix whatever
> breaks.

Looks like your scare tactics are working ;-)
[PATCH/RFC] m68k: Add full preempt support
https://lore.kernel.org/all/7858a184cda66e0991fd295c711dfed7e4d1248c.1696603287.git.geert@linux-m68k.org

Gr{oetje,eeting}s,

                        Geert
Paul E. McKenney Oct. 18, 2023, 1:03 a.m. UTC | #99
On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> >> That said - I think as a proof of concept and "look, with this we get
> >> the expected scheduling event counts", that patch is perfect. I think
> >> you more than proved the concept.
> >
> > There is certainly quite some analyis work to do to make this a one to
> > one replacement.
> >
> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> > is pretty much on par with the current mainline variants (NONE/FULL),
> > but the memtier benchmark makes a massive dent.
> >
> > It sports a whopping 10% regression with the LAZY mode versus the mainline
> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
> >
> > That benchmark is really sensitive to the preemption model. With current
> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> > performance drop versus preempt=NONE.
> 
> That 20% was a tired pilot error. The real number is in the 5% ballpark.
> 
> > I have no clue what's going on there yet, but that shows that there is
> > obviously quite some work ahead to get this sorted.
> 
> It took some head scratching to figure that out. The initial fix broke
> the handling of the hog issue, i.e. the problem that Ankur tried to
> solve, but I hacked up a "solution" for that too.
> 
> With that the memtier benchmark is roughly back to the mainline numbers,
> but my throughput benchmark know how is pretty close to zero, so that
> should be looked at by people who actually understand these things.
> 
> Likewise the hog prevention is just at the PoC level and clearly beyond
> my knowledge of scheduler details: It unconditionally forces a
> reschedule when the looping task is not responding to a lazy reschedule
> request before the next tick. IOW it forces a reschedule on the second
> tick, which is obviously different from the cond_resched()/might_sleep()
> behaviour.
> 
> The changes vs. the original PoC aside of the bug and thinko fixes:
> 
>     1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
>        lazy preempt bit as the trace_entry::flags field is full already.
> 
>        That obviously breaks the tracer ABI, but if we go there then
>        this needs to be fixed. Steven?
> 
>     2) debugfs file to validate that loops can be force preempted w/o
>        cond_resched()
> 
>        The usage is:
> 
>        # taskset -c 1 bash
>        # echo 1 > /sys/kernel/debug/sched/hog &
>        # echo 1 > /sys/kernel/debug/sched/hog &
>        # echo 1 > /sys/kernel/debug/sched/hog &
> 
>        top shows ~33% CPU for each of the hogs and tracing confirms that
>        the crude hack in the scheduler tick works:
> 
>             bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
>             bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
>             bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
>             bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
>             bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
>             bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
>             bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
>             bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr
> 
>        The 'l' instead of the usual 'N' reflects that the lazy resched
>        bit is set. That makes __update_curr() invoke resched_curr()
>        instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
>        and folds it into preempt_count so that preemption happens at the
>        next possible point, i.e. either in return from interrupt or at
>        the next preempt_enable().

Belatedly calling out some RCU issues.  Nothing fatal, just a
(surprisingly) few adjustments that will need to be made.  The key thing
to note is that from RCU's viewpoint, with this change, all kernels
are preemptible, though rcu_read_lock() readers remain non-preemptible.
With that:

1.	As an optimization, given that preempt_count() would always give
	good information, the scheduling-clock interrupt could sense RCU
	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
	IPI handlers for expedited grace periods.  A nice optimization.
	Except that...

2.	The quiescent-state-forcing code currently relies on the presence
	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
	would be to do resched_cpu() more quickly, but some workloads
	might not love the additional IPIs.  Another approach to do #1
	above to replace the quiescent states from cond_resched() with
	scheduler-tick-interrupt-sensed quiescent states.

	Plus...

3.	For nohz_full CPUs that run for a long time in the kernel,
	there are no scheduling-clock interrupts.  RCU reaches for
	the resched_cpu() hammer a few jiffies into the grace period.
	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
	interrupt-entry code will re-enable its scheduling-clock interrupt
	upon receiving the resched_cpu() IPI.

	So nohz_full CPUs should be OK as far as RCU is concerned.
	Other subsystems might have other opinions.

4.	As another optimization, kvfree_rcu() could unconditionally
	check preempt_count() to sense a clean environment suitable for
	memory allocation.

5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
	instead say "select TASKS_RCU".  This means that the #else
	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
	vanilla RCU must go.  There might be be some fallout if something
	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
	rcu_tasks_classic_qs() do do something useful.

6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
	or RCU Tasks Rude) would need those pesky cond_resched() calls
	to stick around.  The reason is that RCU Tasks readers are ended
	only by voluntary context switches.  This means that although a
	preemptible infinite loop in the kernel won't inconvenience a
	real-time task (nor an non-real-time task for all that long),
	and won't delay grace periods for the other flavors of RCU,
	it would indefinitely delay an RCU Tasks grace period.

	However, RCU Tasks grace periods seem to be finite in preemptible
	kernels today, so they should remain finite in limited-preemptible
	kernels tomorrow.  Famous last words...

7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
	any algorithmic difference from this change.

8.	As has been noted elsewhere, in this new limited-preemption
	mode of operation, rcu_read_lock() readers remain preemptible.
	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.

9.	The rcu_preempt_depth() macro could do something useful in
	limited-preemption kernels.  Its current lack of ability in
	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.

10.	The cond_resched_rcu() function must remain because we still
	have non-preemptible rcu_read_lock() readers.

11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
	unchanged, but I must defer to the include/net/ip_vs.h people.

12.	I need to check with the BPF folks on the BPF verifier's
	definition of BTF_ID(func, rcu_read_unlock_strict).

13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
	function might have some redundancy across the board instead
	of just on CONFIG_PREEMPT_RCU=y.  Or might not.

14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
	might need to do something for non-preemptible RCU to make
	up for the lack of cond_resched() calls.  Maybe just drop the
	"IS_ENABLED()" and execute the body of the current "if" statement
	unconditionally.

15.	I must defer to others on the mm/pgtable-generic.c file's
	#ifdef that depends on CONFIG_PREEMPT_RCU.

While in the area, I noted that KLP seems to depend on cond_resched(),
but on this I must defer to the KLP people.

I am sure that I am missing something, but I have not yet seen any
show-stoppers.  Just some needed adjustments.

Thoughts?

							Thanx, Paul

> That's as much as I wanted to demonstrate and I'm not going to spend
> more cycles on it as I have already too many other things on flight and
> the resulting scheduler woes are clearly outside of my expertice.
> 
> Though definitely I'm putting a permanent NAK in place for any attempts
> to duct tape the preempt=NONE model any further by sprinkling more
> cond*() and whatever warts around.
> 
> Thanks,
> 
>         tglx
> ---
>  arch/x86/Kconfig                   |    1 
>  arch/x86/include/asm/thread_info.h |    6 ++--
>  drivers/acpi/processor_idle.c      |    2 -
>  include/linux/entry-common.h       |    2 -
>  include/linux/entry-kvm.h          |    2 -
>  include/linux/sched.h              |   12 +++++---
>  include/linux/sched/idle.h         |    8 ++---
>  include/linux/thread_info.h        |   24 +++++++++++++++++
>  include/linux/trace_events.h       |    8 ++---
>  kernel/Kconfig.preempt             |   17 +++++++++++-
>  kernel/entry/common.c              |    4 +-
>  kernel/entry/kvm.c                 |    2 -
>  kernel/sched/core.c                |   51 +++++++++++++++++++++++++------------
>  kernel/sched/debug.c               |   19 +++++++++++++
>  kernel/sched/fair.c                |   46 ++++++++++++++++++++++-----------
>  kernel/sched/features.h            |    2 +
>  kernel/sched/idle.c                |    3 --
>  kernel/sched/sched.h               |    1 
>  kernel/trace/trace.c               |    2 +
>  kernel/trace/trace_output.c        |   16 ++++++++++-
>  20 files changed, 171 insertions(+), 57 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
>  
>  #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
>  /*
> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
>   * this avoids any races wrt polling state changes and thereby avoids
>   * spurious IPIs.
>   */
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>  {
>  	struct thread_info *ti = task_thread_info(p);
> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> +
> +	return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
>  }
>  
>  /*
> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
>  	for (;;) {
>  		if (!(val & _TIF_POLLING_NRFLAG))
>  			return false;
> -		if (val & _TIF_NEED_RESCHED)
> +		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>  			return true;
>  		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
>  			break;
> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
>  }
>  
>  #else
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>  {
> -	set_tsk_need_resched(p);
> +	set_tsk_thread_flag(p, tif_bit);
>  	return true;
>  }
>  
> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
>   * might also involve a cross-CPU call to trigger the scheduler on
>   * the target CPU.
>   */
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int lazy)
>  {
> +	int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
>  	struct task_struct *curr = rq->curr;
> -	int cpu;
>  
>  	lockdep_assert_rq_held(rq);
>  
> -	if (test_tsk_need_resched(curr))
> +	if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
>  		return;
>  
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr);
> -		set_preempt_need_resched();
> +		set_tsk_thread_flag(curr, tif_bit);
> +		if (!lazy)
> +			set_preempt_need_resched();
>  		return;
>  	}
>  
> -	if (set_nr_and_not_polling(curr))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(curr, tif_bit)) {
> +		if (!lazy)
> +			smp_send_reschedule(cpu);
> +	} else {
>  		trace_sched_wake_idle_without_ipi(cpu);
> +	}
> +}
> +
> +void resched_curr(struct rq *rq)
> +{
> +	__resched_curr(rq, 0);
> +}
> +
> +void resched_curr_lazy(struct rq *rq)
> +{
> +	int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
> +		TIF_NEED_RESCHED_LAZY_OFFSET : 0;
> +
> +	if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
> +		return;
> +
> +	__resched_curr(rq, lazy);
>  }
>  
>  void resched_cpu(int cpu)
> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
>  	if (cpu == smp_processor_id())
>  		return;
>  
> -	if (set_nr_and_not_polling(rq->idle))
> +	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
>  		smp_send_reschedule(cpu);
>  	else
>  		trace_sched_wake_idle_without_ipi(cpu);
> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
>  		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>  		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
>  	}									 \
> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
>  
>  PREEMPT_MODEL_ACCESSOR(none);
>  PREEMPT_MODEL_ACCESSOR(voluntary);
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,16 @@ enum syscall_work_bit {
>  
>  #include <asm/thread_info.h>
>  
> +#ifdef CONFIG_PREEMPT_AUTO
> +# define TIF_NEED_RESCHED_LAZY		TIF_ARCH_RESCHED_LAZY
> +# define _TIF_NEED_RESCHED_LAZY		_TIF_ARCH_RESCHED_LAZY
> +# define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
> +#else
> +# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
> +# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
> +# define TIF_NEED_RESCHED_LAZY_OFFSET	0
> +#endif
> +
>  #ifdef __KERNEL__
>  
>  #ifndef arch_set_restart_data
> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
>  			     (unsigned long *)(&current_thread_info()->flags));
>  }
>  
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> +		arch_test_bit(TIF_NEED_RESCHED_LAZY,
> +			      (unsigned long *)(&current_thread_info()->flags));
> +}
> +
>  #else
>  
>  static __always_inline bool tif_need_resched(void)
> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
>  			(unsigned long *)(&current_thread_info()->flags));
>  }
>  
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> +		test_bit(TIF_NEED_RESCHED_LAZY,
> +			 (unsigned long *)(&current_thread_info()->flags));
> +}
> +
>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>  
>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
>  	select PREEMPTION
>  	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>  
> +config PREEMPT_BUILD_AUTO
> +	bool
> +	select PREEMPT_BUILD
> +
> +config HAVE_PREEMPT_AUTO
> +	bool
> +
>  choice
>  	prompt "Preemption Model"
>  	default PREEMPT_NONE
> @@ -67,9 +74,17 @@ config PREEMPT
>  	  embedded system with latency requirements in the milliseconds
>  	  range.
>  
> +config PREEMPT_AUTO
> +	bool "Automagic preemption mode with runtime tweaking support"
> +	depends on HAVE_PREEMPT_AUTO
> +	select PREEMPT_BUILD_AUTO
> +	help
> +	  Add some sensible blurb here
> +
>  config PREEMPT_RT
>  	bool "Fully Preemptible Kernel (Real-Time)"
>  	depends on EXPERT && ARCH_SUPPORTS_RT
> +	select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
>  	select PREEMPTION
>  	help
>  	  This option turns the kernel into a real-time kernel by replacing
> @@ -95,7 +110,7 @@ config PREEMPTION
>  
>  config PREEMPT_DYNAMIC
>  	bool "Preemption behaviour defined on boot"
> -	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
> +	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
>  	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
>  	select PREEMPT_BUILD
>  	default y if HAVE_PREEMPT_DYNAMIC_CALL
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -60,7 +60,7 @@
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 ARCH_EXIT_TO_USER_MODE_WORK)
> +	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
>  
>  /**
>   * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -18,7 +18,7 @@
>  
>  #define XFER_TO_GUEST_MODE_WORK						\
>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>  
>  struct kvm_vcpu;
>  
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
>  
>  		local_irq_enable_exit_to_user(ti_work);
>  
> -		if (ti_work & _TIF_NEED_RESCHED)
> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>  			schedule();
>  
>  		if (ti_work & _TIF_UPROBE)
> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
>  		rcu_irq_exit_check_preempt();
>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>  			WARN_ON_ONCE(!on_thread_stack());
> -		if (need_resched())
> +		if (test_tsk_need_resched(current))
>  			preempt_schedule_irq();
>  	}
>  }
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
>  SCHED_FEAT(LATENCY_WARN, false)
>  
>  SCHED_FEAT(HZ_BW, true)
> +
> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
>  extern void reweight_task(struct task_struct *p, int prio);
>  
>  extern void resched_curr(struct rq *rq);
> +extern void resched_curr_lazy(struct rq *rq);
>  extern void resched_cpu(int cpu);
>  
>  extern struct rt_bandwidth def_rt_bandwidth;
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
>  	update_ti_thread_flag(task_thread_info(tsk), flag, value);
>  }
>  
> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
>  }
>  
> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
>  }
>  
> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>  {
>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>  }
> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>  {
>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> +		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
>  }
>  
> -static inline int test_tsk_need_resched(struct task_struct *tsk)
> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
>  {
>  	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>  }
> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
>  
>  static __always_inline bool need_resched(void)
>  {
> -	return unlikely(tif_need_resched());
> +	return unlikely(tif_need_resched_lazy() || tif_need_resched());
>  }
>  
>  /*
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>   * this is probably good enough.
>   */
> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
>  {
> +	struct rq *rq = rq_of(cfs_rq);
> +
>  	if ((s64)(se->vruntime - se->deadline) < 0)
>  		return;
>  
> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
>  	/*
>  	 * The task has consumed its request, reschedule.
>  	 */
> -	if (cfs_rq->nr_running > 1) {
> -		resched_curr(rq_of(cfs_rq));
> -		clear_buddies(cfs_rq, se);
> +	if (cfs_rq->nr_running < 2)
> +		return;
> +
> +	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
> +		resched_curr(rq);
> +	} else {
> +		/* Did the task ignore the lazy reschedule request? */
> +		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
> +			resched_curr(rq);
> +		else
> +			resched_curr_lazy(rq);
>  	}
> +	clear_buddies(cfs_rq, se);
>  }
>  
>  #include "pelt.h"
> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
>  /*
>   * Update the current task's runtime statistics.
>   */
> -static void update_curr(struct cfs_rq *cfs_rq)
> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>  {
>  	struct sched_entity *curr = cfs_rq->curr;
>  	u64 now = rq_clock_task(rq_of(cfs_rq));
> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
>  
>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
> -	update_deadline(cfs_rq, curr);
> +	update_deadline(cfs_rq, curr, tick);
>  	update_min_vruntime(cfs_rq);
>  
>  	if (entity_is_task(curr)) {
> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
> +static inline void update_curr(struct cfs_rq *cfs_rq)
> +{
> +	__update_curr(cfs_rq, false);
> +}
> +
>  static void update_curr_fair(struct rq *rq)
>  {
>  	update_curr(cfs_rq_of(&rq->curr->se));
> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  	/*
>  	 * Update run-time statistics of the 'current'.
>  	 */
> -	update_curr(cfs_rq);
> +	__update_curr(cfs_rq, true);
>  
>  	/*
>  	 * Ensure that runnable average is periodically updated.
> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  	 * validating it and just reschedule.
>  	 */
>  	if (queued) {
> -		resched_curr(rq_of(cfs_rq));
> +		resched_curr_lazy(rq_of(cfs_rq));
>  		return;
>  	}
>  	/*
> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
>  	 * hierarchy can be throttled
>  	 */
>  	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> -		resched_curr(rq_of(cfs_rq));
> +		resched_curr_lazy(rq_of(cfs_rq));
>  }
>  
>  static __always_inline
> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>  
>  	/* Determine whether we need to wake up potentially idle CPU: */
>  	if (rq->curr == rq->idle && rq->cfs.nr_running)
> -		resched_curr(rq);
> +		resched_curr_lazy(rq);
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
>  
>  		if (delta < 0) {
>  			if (task_current(rq, p))
> -				resched_curr(rq);
> +				resched_curr_lazy(rq);
>  			return;
>  		}
>  		hrtick_start(rq, delta);
> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
>  	 * below.
>  	 */
> -	if (test_tsk_need_resched(curr))
> +	if (need_resched())
>  		return;
>  
>  	/* Idle tasks are by definition preempted by non-idle tasks. */
> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
>  	return;
>  
>  preempt:
> -	resched_curr(rq);
> +	resched_curr_lazy(rq);
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
>  	 */
>  	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> -		resched_curr(rq);
> +		resched_curr_lazy(rq);
>  }
>  
>  /*
> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
>  	 */
>  	if (task_current(rq, p)) {
>  		if (p->prio > oldprio)
> -			resched_curr(rq);
> +			resched_curr_lazy(rq);
>  	} else
>  		check_preempt_curr(rq, p, 0);
>  }
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
>   */
>  static void __cpuidle acpi_safe_halt(void)
>  {
> -	if (!tif_need_resched()) {
> +	if (!need_resched()) {
>  		raw_safe_halt();
>  		raw_local_irq_disable();
>  	}
> --- a/include/linux/sched/idle.h
> +++ b/include/linux/sched/idle.h
> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
>  	 */
>  	smp_mb__after_atomic();
>  
> -	return unlikely(tif_need_resched());
> +	return unlikely(need_resched());
>  }
>  
>  static __always_inline bool __must_check current_clr_polling_and_test(void)
> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
>  	 */
>  	smp_mb__after_atomic();
>  
> -	return unlikely(tif_need_resched());
> +	return unlikely(need_resched());
>  }
>  
>  #else
> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
>  
>  static inline bool __must_check current_set_polling_and_test(void)
>  {
> -	return unlikely(tif_need_resched());
> +	return unlikely(need_resched());
>  }
>  static inline bool __must_check current_clr_polling_and_test(void)
>  {
> -	return unlikely(tif_need_resched());
> +	return unlikely(need_resched());
>  }
>  #endif
>  
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
>  	ct_cpuidle_enter();
>  
>  	raw_local_irq_enable();
> -	while (!tif_need_resched() &&
> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
> +	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
>  		cpu_relax();
>  	raw_local_irq_disable();
>  
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>  
>  	if (tif_need_resched())
>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
> +	if (tif_need_resched_lazy())
> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>  	if (test_preempt_need_resched())
>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -271,6 +271,7 @@ config X86
>  	select HAVE_STATIC_CALL
>  	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
>  	select HAVE_PREEMPT_DYNAMIC_CALL
> +	select HAVE_PREEMPT_AUTO
>  	select HAVE_RSEQ
>  	select HAVE_RUST			if X86_64
>  	select HAVE_SYSCALL_TRACEPOINTS
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -81,8 +81,9 @@ struct thread_info {
>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
>  #define TIF_SIGPENDING		2	/* signal pending */
>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
> -#define TIF_SSBD		5	/* Speculative store bypass disable */
> +#define TIF_ARCH_RESCHED_LAZY	4	/* Lazy rescheduling */
> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
> +#define TIF_SSBD		6	/* Speculative store bypass disable */
>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
> @@ -104,6 +105,7 @@ struct thread_info {
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> +#define _TIF_ARCH_RESCHED_LAZY	(1 << TIF_ARCH_RESCHED_LAZY)
>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD		(1 << TIF_SSBD)
>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
>  			return -EINTR;
>  		}
>  
> -		if (ti_work & _TIF_NEED_RESCHED)
> +		if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
>  			schedule();
>  
>  		if (ti_work & _TIF_NOTIFY_RESUME)
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>  
>  enum trace_flag_type {
>  	TRACE_FLAG_IRQS_OFF		= 0x01,
> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
>  	TRACE_FLAG_HARDIRQ		= 0x08,
>  	TRACE_FLAG_SOFTIRQ		= 0x10,
>  	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>  
>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  static inline unsigned int tracing_gen_ctx(void)
>  {
> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> +	return tracing_gen_ctx_irq_test(0);
>  }
>  #endif
>  
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>  		bh_off ? 'b' :
> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>  		'.';
>  
> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>  				TRACE_FLAG_PREEMPT_RESCHED)) {
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'B';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'N';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> +		need_resched = 'L';
> +		break;
> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'b';
> +		break;
>  	case TRACE_FLAG_NEED_RESCHED:
>  		need_resched = 'n';
>  		break;
> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
> +		need_resched = 'l';
> +		break;
>  	case TRACE_FLAG_PREEMPT_RESCHED:
>  		need_resched = 'p';
>  		break;
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -333,6 +333,23 @@ static const struct file_operations sche
>  	.release	= seq_release,
>  };
>  
> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
> +			       size_t cnt, loff_t *ppos)
> +{
> +	unsigned long end = jiffies + 60 * HZ;
> +
> +	for (; time_before(jiffies, end) && !signal_pending(current);)
> +		cpu_relax();
> +
> +	return cnt;
> +}
> +
> +static const struct file_operations sched_hog_fops = {
> +	.write		= sched_hog_write,
> +	.open		= simple_open,
> +	.llseek		= default_llseek,
> +};
> +
>  static struct dentry *debugfs_sched;
>  
>  static __init int sched_init_debug(void)
> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
>  
>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>  
> +	debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
> +
>  	return 0;
>  }
>  late_initcall(sched_init_debug);
>
Ankur Arora Oct. 18, 2023, 12:09 p.m. UTC | #100
Paul E. McKenney <paulmck@kernel.org> writes:

> On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
>> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
>> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> >> That said - I think as a proof of concept and "look, with this we get
>> >> the expected scheduling event counts", that patch is perfect. I think
>> >> you more than proved the concept.
>> >
>> > There is certainly quite some analyis work to do to make this a one to
>> > one replacement.
>> >
>> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
>> > is pretty much on par with the current mainline variants (NONE/FULL),
>> > but the memtier benchmark makes a massive dent.
>> >
>> > It sports a whopping 10% regression with the LAZY mode versus the mainline
>> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>> >
>> > That benchmark is really sensitive to the preemption model. With current
>> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
>> > performance drop versus preempt=NONE.
>>
>> That 20% was a tired pilot error. The real number is in the 5% ballpark.
>>
>> > I have no clue what's going on there yet, but that shows that there is
>> > obviously quite some work ahead to get this sorted.
>>
>> It took some head scratching to figure that out. The initial fix broke
>> the handling of the hog issue, i.e. the problem that Ankur tried to
>> solve, but I hacked up a "solution" for that too.
>>
>> With that the memtier benchmark is roughly back to the mainline numbers,
>> but my throughput benchmark know how is pretty close to zero, so that
>> should be looked at by people who actually understand these things.
>>
>> Likewise the hog prevention is just at the PoC level and clearly beyond
>> my knowledge of scheduler details: It unconditionally forces a
>> reschedule when the looping task is not responding to a lazy reschedule
>> request before the next tick. IOW it forces a reschedule on the second
>> tick, which is obviously different from the cond_resched()/might_sleep()
>> behaviour.
>>
>> The changes vs. the original PoC aside of the bug and thinko fixes:
>>
>>     1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
>>        lazy preempt bit as the trace_entry::flags field is full already.
>>
>>        That obviously breaks the tracer ABI, but if we go there then
>>        this needs to be fixed. Steven?
>>
>>     2) debugfs file to validate that loops can be force preempted w/o
>>        cond_resched()
>>
>>        The usage is:
>>
>>        # taskset -c 1 bash
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>
>>        top shows ~33% CPU for each of the hogs and tracing confirms that
>>        the crude hack in the scheduler tick works:
>>
>>             bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
>>             bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
>>             bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
>>             bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
>>             bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr
>>
>>        The 'l' instead of the usual 'N' reflects that the lazy resched
>>        bit is set. That makes __update_curr() invoke resched_curr()
>>        instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
>>        and folds it into preempt_count so that preemption happens at the
>>        next possible point, i.e. either in return from interrupt or at
>>        the next preempt_enable().
>
> Belatedly calling out some RCU issues.  Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made.  The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain non-preemptible.

Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models
none/voluntary/full are just scheduler tweaks on top of that. And, so
this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock()
readers be preemptible?

(An alternate configuration might be:
   config PREEMPT_NONE
      select PREEMPT_COUNT

    config PREEMPT_FULL
      select PREEMPTION

 This probably allows for more configuration flexibility across archs?
 Would allow for TREE_RCU=y, for instance. That said, so far I've only
 been working with PREEMPT_RCU=y.)

> With that:
>
> 1.	As an optimization, given that preempt_count() would always give
> 	good information, the scheduling-clock interrupt could sense RCU
> 	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
> 	IPI handlers for expedited grace periods.  A nice optimization.
> 	Except that...
>
> 2.	The quiescent-state-forcing code currently relies on the presence
> 	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
> 	would be to do resched_cpu() more quickly, but some workloads
> 	might not love the additional IPIs.  Another approach to do #1
> 	above to replace the quiescent states from cond_resched() with
> 	scheduler-tick-interrupt-sensed quiescent states.

Right, the call to rcu_all_qs(). Just to see if I have it straight,
something like this for PREEMPT_RCU=n kernels?

   if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0)
           rcu_all_qs();

(Masked because PREEMPT_NONE might not do any folding for
NEED_RESCHED_LAZY in the tick.)

Though the comment around rcu_all_qs() mentions that rcu_all_qs()
reports a quiescent state only if urgently needed. Given that the tick
executes less frequently than calls to cond_resched(), could we just
always report instead? Or I'm completely on the wrong track?

   if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) {
             preempt_disable();
             rcu_qs();
             preempt_enable();
   }

On your point about the preempt_count() being dependable, there's a
wrinkle. As Linus mentions in
https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/,
that might not be true for architectures that define ARCH_NO_PREEMPT.

My plan was to limit those archs to do preemption only at user space boundary
but there are almost certainly RCU implications that I missed.

> 	Plus...
>
> 3.	For nohz_full CPUs that run for a long time in the kernel,
> 	there are no scheduling-clock interrupts.  RCU reaches for
> 	the resched_cpu() hammer a few jiffies into the grace period.
> 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> 	interrupt-entry code will re-enable its scheduling-clock interrupt
> 	upon receiving the resched_cpu() IPI.
>
> 	So nohz_full CPUs should be OK as far as RCU is concerned.
> 	Other subsystems might have other opinions.

Ah, that's what I thought from my reading of the RCU comments. Good to
have that confirmed. Thanks.

> 4.	As another optimization, kvfree_rcu() could unconditionally
> 	check preempt_count() to sense a clean environment suitable for
> 	memory allocation.

Had missed this completely. Could you elaborate?

> 5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
> 	instead say "select TASKS_RCU".  This means that the #else
> 	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> 	vanilla RCU must go.  There might be be some fallout if something
> 	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> 	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> 	rcu_tasks_classic_qs() do do something useful.

Ack.

> 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> 	to stick around.  The reason is that RCU Tasks readers are ended
> 	only by voluntary context switches.  This means that although a
> 	preemptible infinite loop in the kernel won't inconvenience a
> 	real-time task (nor an non-real-time task for all that long),
> 	and won't delay grace periods for the other flavors of RCU,
> 	it would indefinitely delay an RCU Tasks grace period.
>
> 	However, RCU Tasks grace periods seem to be finite in preemptible
> 	kernels today, so they should remain finite in limited-preemptible
> 	kernels tomorrow.  Famous last words...
>
> 7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> 	any algorithmic difference from this change.

So, essentially, as long as RCU tasks eventually, in the fullness of
time, call schedule(), removing cond_resched() shouldn't have any
effect :).

> 8.	As has been noted elsewhere, in this new limited-preemption
> 	mode of operation, rcu_read_lock() readers remain preemptible.
> 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.

Ack.

> 9.	The rcu_preempt_depth() macro could do something useful in
> 	limited-preemption kernels.  Its current lack of ability in
> 	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
>
> 10.	The cond_resched_rcu() function must remain because we still
> 	have non-preemptible rcu_read_lock() readers.

For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need
only be this, right?:

   static inline void cond_resched_rcu(void)
   {
   #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
           rcu_read_unlock();

           rcu_read_lock();
   #endif
   }

> 11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> 	unchanged, but I must defer to the include/net/ip_vs.h people.
>
> 12.	I need to check with the BPF folks on the BPF verifier's
> 	definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> 	function might have some redundancy across the board instead
> 	of just on CONFIG_PREEMPT_RCU=y.  Or might not.

I don't think I understand any of these well enough to comment. Will
Cc the relevant folks when I send out the RFC.

> 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> 	might need to do something for non-preemptible RCU to make
> 	up for the lack of cond_resched() calls.  Maybe just drop the
> 	"IS_ENABLED()" and execute the body of the current "if" statement
> 	unconditionally.

Aah, yes this is a good idea. Thanks.

> 15.	I must defer to others on the mm/pgtable-generic.c file's
> 	#ifdef that depends on CONFIG_PREEMPT_RCU.
>
> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.

Yeah, as part of this work, I ended up unhooking most of the KLP
hooks in cond_resched() and of course, cond_resched() itself.
Will poke the livepatching people.

> I am sure that I am missing something, but I have not yet seen any
> show-stoppers.  Just some needed adjustments.

Appreciate this detailed list. Makes me think that everything might
not go up in smoke after all!

Thanks
Ankur

> Thoughts?
>
> 							Thanx, Paul
>
>> That's as much as I wanted to demonstrate and I'm not going to spend
>> more cycles on it as I have already too many other things on flight and
>> the resulting scheduler woes are clearly outside of my expertice.
>>
>> Though definitely I'm putting a permanent NAK in place for any attempts
>> to duct tape the preempt=NONE model any further by sprinkling more
>> cond*() and whatever warts around.
>>
>> Thanks,
>>
>>         tglx
>> ---
>>  arch/x86/Kconfig                   |    1
>>  arch/x86/include/asm/thread_info.h |    6 ++--
>>  drivers/acpi/processor_idle.c      |    2 -
>>  include/linux/entry-common.h       |    2 -
>>  include/linux/entry-kvm.h          |    2 -
>>  include/linux/sched.h              |   12 +++++---
>>  include/linux/sched/idle.h         |    8 ++---
>>  include/linux/thread_info.h        |   24 +++++++++++++++++
>>  include/linux/trace_events.h       |    8 ++---
>>  kernel/Kconfig.preempt             |   17 +++++++++++-
>>  kernel/entry/common.c              |    4 +-
>>  kernel/entry/kvm.c                 |    2 -
>>  kernel/sched/core.c                |   51 +++++++++++++++++++++++++------------
>>  kernel/sched/debug.c               |   19 +++++++++++++
>>  kernel/sched/fair.c                |   46 ++++++++++++++++++++++-----------
>>  kernel/sched/features.h            |    2 +
>>  kernel/sched/idle.c                |    3 --
>>  kernel/sched/sched.h               |    1
>>  kernel/trace/trace.c               |    2 +
>>  kernel/trace/trace_output.c        |   16 ++++++++++-
>>  20 files changed, 171 insertions(+), 57 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
>>
>>  #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
>>  /*
>> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
>> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
>>   * this avoids any races wrt polling state changes and thereby avoids
>>   * spurious IPIs.
>>   */
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>>  {
>>  	struct thread_info *ti = task_thread_info(p);
>> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
>> +
>> +	return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
>>  }
>>
>>  /*
>> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
>>  	for (;;) {
>>  		if (!(val & _TIF_POLLING_NRFLAG))
>>  			return false;
>> -		if (val & _TIF_NEED_RESCHED)
>> +		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>>  			return true;
>>  		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
>>  			break;
>> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
>>  }
>>
>>  #else
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>>  {
>> -	set_tsk_need_resched(p);
>> +	set_tsk_thread_flag(p, tif_bit);
>>  	return true;
>>  }
>>
>> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
>>   * might also involve a cross-CPU call to trigger the scheduler on
>>   * the target CPU.
>>   */
>> -void resched_curr(struct rq *rq)
>> +static void __resched_curr(struct rq *rq, int lazy)
>>  {
>> +	int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
>>  	struct task_struct *curr = rq->curr;
>> -	int cpu;
>>
>>  	lockdep_assert_rq_held(rq);
>>
>> -	if (test_tsk_need_resched(curr))
>> +	if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
>>  		return;
>>
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr);
>> -		set_preempt_need_resched();
>> +		set_tsk_thread_flag(curr, tif_bit);
>> +		if (!lazy)
>> +			set_preempt_need_resched();
>>  		return;
>>  	}
>>
>> -	if (set_nr_and_not_polling(curr))
>> -		smp_send_reschedule(cpu);
>> -	else
>> +	if (set_nr_and_not_polling(curr, tif_bit)) {
>> +		if (!lazy)
>> +			smp_send_reschedule(cpu);
>> +	} else {
>>  		trace_sched_wake_idle_without_ipi(cpu);
>> +	}
>> +}
>> +
>> +void resched_curr(struct rq *rq)
>> +{
>> +	__resched_curr(rq, 0);
>> +}
>> +
>> +void resched_curr_lazy(struct rq *rq)
>> +{
>> +	int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
>> +		TIF_NEED_RESCHED_LAZY_OFFSET : 0;
>> +
>> +	if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
>> +		return;
>> +
>> +	__resched_curr(rq, lazy);
>>  }
>>
>>  void resched_cpu(int cpu)
>> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
>>  	if (cpu == smp_processor_id())
>>  		return;
>>
>> -	if (set_nr_and_not_polling(rq->idle))
>> +	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
>>  		smp_send_reschedule(cpu);
>>  	else
>>  		trace_sched_wake_idle_without_ipi(cpu);
>> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
>>  		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>>  		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
>>  	}									 \
>> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
>>
>>  PREEMPT_MODEL_ACCESSOR(none);
>>  PREEMPT_MODEL_ACCESSOR(voluntary);
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -59,6 +59,16 @@ enum syscall_work_bit {
>>
>>  #include <asm/thread_info.h>
>>
>> +#ifdef CONFIG_PREEMPT_AUTO
>> +# define TIF_NEED_RESCHED_LAZY		TIF_ARCH_RESCHED_LAZY
>> +# define _TIF_NEED_RESCHED_LAZY		_TIF_ARCH_RESCHED_LAZY
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
>> +#else
>> +# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
>> +# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET	0
>> +#endif
>> +
>>  #ifdef __KERNEL__
>>
>>  #ifndef arch_set_restart_data
>> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
>>  			     (unsigned long *)(&current_thread_info()->flags));
>>  }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> +		arch_test_bit(TIF_NEED_RESCHED_LAZY,
>> +			      (unsigned long *)(&current_thread_info()->flags));
>> +}
>> +
>>  #else
>>
>>  static __always_inline bool tif_need_resched(void)
>> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
>>  			(unsigned long *)(&current_thread_info()->flags));
>>  }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> +		test_bit(TIF_NEED_RESCHED_LAZY,
>> +			 (unsigned long *)(&current_thread_info()->flags));
>> +}
>> +
>>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>>
>>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
>>  	select PREEMPTION
>>  	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>>
>> +config PREEMPT_BUILD_AUTO
>> +	bool
>> +	select PREEMPT_BUILD
>> +
>> +config HAVE_PREEMPT_AUTO
>> +	bool
>> +
>>  choice
>>  	prompt "Preemption Model"
>>  	default PREEMPT_NONE
>> @@ -67,9 +74,17 @@ config PREEMPT
>>  	  embedded system with latency requirements in the milliseconds
>>  	  range.
>>
>> +config PREEMPT_AUTO
>> +	bool "Automagic preemption mode with runtime tweaking support"
>> +	depends on HAVE_PREEMPT_AUTO
>> +	select PREEMPT_BUILD_AUTO
>> +	help
>> +	  Add some sensible blurb here
>> +
>>  config PREEMPT_RT
>>  	bool "Fully Preemptible Kernel (Real-Time)"
>>  	depends on EXPERT && ARCH_SUPPORTS_RT
>> +	select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
>>  	select PREEMPTION
>>  	help
>>  	  This option turns the kernel into a real-time kernel by replacing
>> @@ -95,7 +110,7 @@ config PREEMPTION
>>
>>  config PREEMPT_DYNAMIC
>>  	bool "Preemption behaviour defined on boot"
>> -	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
>> +	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
>>  	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
>>  	select PREEMPT_BUILD
>>  	default y if HAVE_PREEMPT_DYNAMIC_CALL
>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -60,7 +60,7 @@
>>  #define EXIT_TO_USER_MODE_WORK						\
>>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
>> -	 ARCH_EXIT_TO_USER_MODE_WORK)
>> +	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
>>
>>  /**
>>   * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
>> --- a/include/linux/entry-kvm.h
>> +++ b/include/linux/entry-kvm.h
>> @@ -18,7 +18,7 @@
>>
>>  #define XFER_TO_GUEST_MODE_WORK						\
>>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
>> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
>> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>>
>>  struct kvm_vcpu;
>>
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
>>
>>  		local_irq_enable_exit_to_user(ti_work);
>>
>> -		if (ti_work & _TIF_NEED_RESCHED)
>> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>>  			schedule();
>>
>>  		if (ti_work & _TIF_UPROBE)
>> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
>>  		rcu_irq_exit_check_preempt();
>>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>>  			WARN_ON_ONCE(!on_thread_stack());
>> -		if (need_resched())
>> +		if (test_tsk_need_resched(current))
>>  			preempt_schedule_irq();
>>  	}
>>  }
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
>>  SCHED_FEAT(LATENCY_WARN, false)
>>
>>  SCHED_FEAT(HZ_BW, true)
>> +
>> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
>>  extern void reweight_task(struct task_struct *p, int prio);
>>
>>  extern void resched_curr(struct rq *rq);
>> +extern void resched_curr_lazy(struct rq *rq);
>>  extern void resched_cpu(int cpu);
>>
>>  extern struct rt_bandwidth def_rt_bandwidth;
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
>>  	update_ti_thread_flag(task_thread_info(tsk), flag, value);
>>  }
>>
>> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>>
>> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>>
>> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
>>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>>  {
>>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> +		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
>>  }
>>
>> -static inline int test_tsk_need_resched(struct task_struct *tsk)
>> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
>>  {
>>  	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>>  }
>> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
>>
>>  static __always_inline bool need_resched(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(tif_need_resched_lazy() || tif_need_resched());
>>  }
>>
>>  /*
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
>>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>>   * this is probably good enough.
>>   */
>> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
>>  {
>> +	struct rq *rq = rq_of(cfs_rq);
>> +
>>  	if ((s64)(se->vruntime - se->deadline) < 0)
>>  		return;
>>
>> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
>>  	/*
>>  	 * The task has consumed its request, reschedule.
>>  	 */
>> -	if (cfs_rq->nr_running > 1) {
>> -		resched_curr(rq_of(cfs_rq));
>> -		clear_buddies(cfs_rq, se);
>> +	if (cfs_rq->nr_running < 2)
>> +		return;
>> +
>> +	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
>> +		resched_curr(rq);
>> +	} else {
>> +		/* Did the task ignore the lazy reschedule request? */
>> +		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
>> +			resched_curr(rq);
>> +		else
>> +			resched_curr_lazy(rq);
>>  	}
>> +	clear_buddies(cfs_rq, se);
>>  }
>>
>>  #include "pelt.h"
>> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
>>  /*
>>   * Update the current task's runtime statistics.
>>   */
>> -static void update_curr(struct cfs_rq *cfs_rq)
>> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>>  {
>>  	struct sched_entity *curr = cfs_rq->curr;
>>  	u64 now = rq_clock_task(rq_of(cfs_rq));
>> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
>>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
>>
>>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
>> -	update_deadline(cfs_rq, curr);
>> +	update_deadline(cfs_rq, curr, tick);
>>  	update_min_vruntime(cfs_rq);
>>
>>  	if (entity_is_task(curr)) {
>> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
>>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
>>
>> +static inline void update_curr(struct cfs_rq *cfs_rq)
>> +{
>> +	__update_curr(cfs_rq, false);
>> +}
>> +
>>  static void update_curr_fair(struct rq *rq)
>>  {
>>  	update_curr(cfs_rq_of(&rq->curr->se));
>> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>>  	/*
>>  	 * Update run-time statistics of the 'current'.
>>  	 */
>> -	update_curr(cfs_rq);
>> +	__update_curr(cfs_rq, true);
>>
>>  	/*
>>  	 * Ensure that runnable average is periodically updated.
>> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>>  	 * validating it and just reschedule.
>>  	 */
>>  	if (queued) {
>> -		resched_curr(rq_of(cfs_rq));
>> +		resched_curr_lazy(rq_of(cfs_rq));
>>  		return;
>>  	}
>>  	/*
>> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
>>  	 * hierarchy can be throttled
>>  	 */
>>  	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
>> -		resched_curr(rq_of(cfs_rq));
>> +		resched_curr_lazy(rq_of(cfs_rq));
>>  }
>>
>>  static __always_inline
>> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>>
>>  	/* Determine whether we need to wake up potentially idle CPU: */
>>  	if (rq->curr == rq->idle && rq->cfs.nr_running)
>> -		resched_curr(rq);
>> +		resched_curr_lazy(rq);
>>  }
>>
>>  #ifdef CONFIG_SMP
>> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
>>
>>  		if (delta < 0) {
>>  			if (task_current(rq, p))
>> -				resched_curr(rq);
>> +				resched_curr_lazy(rq);
>>  			return;
>>  		}
>>  		hrtick_start(rq, delta);
>> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
>>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
>>  	 * below.
>>  	 */
>> -	if (test_tsk_need_resched(curr))
>> +	if (need_resched())
>>  		return;
>>
>>  	/* Idle tasks are by definition preempted by non-idle tasks. */
>> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
>>  	return;
>>
>>  preempt:
>> -	resched_curr(rq);
>> +	resched_curr_lazy(rq);
>>  }
>>
>>  #ifdef CONFIG_SMP
>> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
>>  	 */
>>  	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
>>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>> -		resched_curr(rq);
>> +		resched_curr_lazy(rq);
>>  }
>>
>>  /*
>> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
>>  	 */
>>  	if (task_current(rq, p)) {
>>  		if (p->prio > oldprio)
>> -			resched_curr(rq);
>> +			resched_curr_lazy(rq);
>>  	} else
>>  		check_preempt_curr(rq, p, 0);
>>  }
>> --- a/drivers/acpi/processor_idle.c
>> +++ b/drivers/acpi/processor_idle.c
>> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
>>   */
>>  static void __cpuidle acpi_safe_halt(void)
>>  {
>> -	if (!tif_need_resched()) {
>> +	if (!need_resched()) {
>>  		raw_safe_halt();
>>  		raw_local_irq_disable();
>>  	}
>> --- a/include/linux/sched/idle.h
>> +++ b/include/linux/sched/idle.h
>> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
>>  	 */
>>  	smp_mb__after_atomic();
>>
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>
>>  static __always_inline bool __must_check current_clr_polling_and_test(void)
>> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
>>  	 */
>>  	smp_mb__after_atomic();
>>
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>
>>  #else
>> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
>>
>>  static inline bool __must_check current_set_polling_and_test(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>  static inline bool __must_check current_clr_polling_and_test(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>  #endif
>>
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
>>  	ct_cpuidle_enter();
>>
>>  	raw_local_irq_enable();
>> -	while (!tif_need_resched() &&
>> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> +	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
>>  		cpu_relax();
>>  	raw_local_irq_disable();
>>
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>>  	if (tif_need_resched())
>>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
>> +	if (tif_need_resched_lazy())
>> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>>  	if (test_preempt_need_resched())
>>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -271,6 +271,7 @@ config X86
>>  	select HAVE_STATIC_CALL
>>  	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
>>  	select HAVE_PREEMPT_DYNAMIC_CALL
>> +	select HAVE_PREEMPT_AUTO
>>  	select HAVE_RSEQ
>>  	select HAVE_RUST			if X86_64
>>  	select HAVE_SYSCALL_TRACEPOINTS
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -81,8 +81,9 @@ struct thread_info {
>>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
>>  #define TIF_SIGPENDING		2	/* signal pending */
>>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
>> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
>> -#define TIF_SSBD		5	/* Speculative store bypass disable */
>> +#define TIF_ARCH_RESCHED_LAZY	4	/* Lazy rescheduling */
>> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
>> +#define TIF_SSBD		6	/* Speculative store bypass disable */
>>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
>>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
>>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
>> @@ -104,6 +105,7 @@ struct thread_info {
>>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>> +#define _TIF_ARCH_RESCHED_LAZY	(1 << TIF_ARCH_RESCHED_LAZY)
>>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>>  #define _TIF_SSBD		(1 << TIF_SSBD)
>>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
>> --- a/kernel/entry/kvm.c
>> +++ b/kernel/entry/kvm.c
>> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
>>  			return -EINTR;
>>  		}
>>
>> -		if (ti_work & _TIF_NEED_RESCHED)
>> +		if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
>>  			schedule();
>>
>>  		if (ti_work & _TIF_NOTIFY_RESUME)
>> --- a/include/linux/trace_events.h
>> +++ b/include/linux/trace_events.h
>> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>>  enum trace_flag_type {
>>  	TRACE_FLAG_IRQS_OFF		= 0x01,
>> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
>> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
>> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
>> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
>>  	TRACE_FLAG_HARDIRQ		= 0x08,
>>  	TRACE_FLAG_SOFTIRQ		= 0x10,
>>  	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
>> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>>
>>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  static inline unsigned int tracing_gen_ctx(void)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  #endif
>>
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>>  		bh_off ? 'b' :
>> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
>> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>>  		'.';
>>
>> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
>> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>>  				TRACE_FLAG_PREEMPT_RESCHED)) {
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'B';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'N';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'L';
>> +		break;
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'b';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED:
>>  		need_resched = 'n';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'l';
>> +		break;
>>  	case TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'p';
>>  		break;
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -333,6 +333,23 @@ static const struct file_operations sche
>>  	.release	= seq_release,
>>  };
>>
>> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
>> +			       size_t cnt, loff_t *ppos)
>> +{
>> +	unsigned long end = jiffies + 60 * HZ;
>> +
>> +	for (; time_before(jiffies, end) && !signal_pending(current);)
>> +		cpu_relax();
>> +
>> +	return cnt;
>> +}
>> +
>> +static const struct file_operations sched_hog_fops = {
>> +	.write		= sched_hog_write,
>> +	.open		= simple_open,
>> +	.llseek		= default_llseek,
>> +};
>> +
>>  static struct dentry *debugfs_sched;
>>
>>  static __init int sched_init_debug(void)
>> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
>>
>>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>
>> +	debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
>> +
>>  	return 0;
>>  }
>>  late_initcall(sched_init_debug);
>>


--
ankur
Thomas Gleixner Oct. 18, 2023, 1:16 p.m. UTC | #101
Paul!

On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> Belatedly calling out some RCU issues.  Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made.  The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain
> non-preemptible.

Why? Either I'm confused or you or both of us :)

With this approach the kernel is by definition fully preemptible, which
means means rcu_read_lock() is preemptible too. That's pretty much the
same situation as with PREEMPT_DYNAMIC.

For throughput sake this fully preemptible kernel provides a mechanism
to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.

That means the preemption points in preempt_enable() and return from
interrupt to kernel will not see NEED_RESCHED and the tasks can run to
completion either to the point where they call schedule() or when they
return to user space. That's pretty much what PREEMPT_NONE does today.

The difference to NONE/VOLUNTARY is that the explicit cond_resched()
points are not longer required because the scheduler can preempt the
long running task by setting NEED_RESCHED instead.

That preemption might be suboptimal in some cases compared to
cond_resched(), but from my initial experimentation that's not really an
issue.

> With that:
>
> 1.	As an optimization, given that preempt_count() would always give
> 	good information, the scheduling-clock interrupt could sense RCU
> 	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
> 	IPI handlers for expedited grace periods.  A nice optimization.
> 	Except that...
>
> 2.	The quiescent-state-forcing code currently relies on the presence
> 	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
> 	would be to do resched_cpu() more quickly, but some workloads
> 	might not love the additional IPIs.  Another approach to do #1
> 	above to replace the quiescent states from cond_resched() with
> 	scheduler-tick-interrupt-sensed quiescent states.

Right. The tick can see either the lazy resched bit "ignored" or some
magic "RCU needs a quiescent state" and force a reschedule.

> 	Plus...
>
> 3.	For nohz_full CPUs that run for a long time in the kernel,
> 	there are no scheduling-clock interrupts.  RCU reaches for
> 	the resched_cpu() hammer a few jiffies into the grace period.
> 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> 	interrupt-entry code will re-enable its scheduling-clock interrupt
> 	upon receiving the resched_cpu() IPI.

You can spare the IPI by setting NEED_RESCHED on the remote CPU which
will cause it to preempt.

> 	So nohz_full CPUs should be OK as far as RCU is concerned.
> 	Other subsystems might have other opinions.
>
> 4.	As another optimization, kvfree_rcu() could unconditionally
> 	check preempt_count() to sense a clean environment suitable for
> 	memory allocation.

Correct. All the limitations of preempt count being useless are gone.

> 5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
> 	instead say "select TASKS_RCU".  This means that the #else
> 	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> 	vanilla RCU must go.  There might be be some fallout if something
> 	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> 	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> 	rcu_tasks_classic_qs() do do something useful.

In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
remaining would be CONFIG_PREEMPT_RT, which should be renamed to
CONFIG_RT or such as it does not really change the preemption
model itself. RT just reduces the preemption disabled sections with the
lock conversions, forced interrupt threading and some more.

> 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> 	to stick around.  The reason is that RCU Tasks readers are ended
> 	only by voluntary context switches.  This means that although a
> 	preemptible infinite loop in the kernel won't inconvenience a
> 	real-time task (nor an non-real-time task for all that long),
> 	and won't delay grace periods for the other flavors of RCU,
> 	it would indefinitely delay an RCU Tasks grace period.
>
> 	However, RCU Tasks grace periods seem to be finite in preemptible
> 	kernels today, so they should remain finite in limited-preemptible
> 	kernels tomorrow.  Famous last words...

That's an issue which you have today with preempt FULL, right? So if it
turns out to be a problem then it's not a problem of the new model.

> 7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> 	any algorithmic difference from this change.
>
> 8.	As has been noted elsewhere, in this new limited-preemption
> 	mode of operation, rcu_read_lock() readers remain preemptible.
> 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.

Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?

> 9.	The rcu_preempt_depth() macro could do something useful in
> 	limited-preemption kernels.  Its current lack of ability in
> 	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.

Correct.

> 10.	The cond_resched_rcu() function must remain because we still
> 	have non-preemptible rcu_read_lock() readers.

Where?

> 11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> 	unchanged, but I must defer to the include/net/ip_vs.h people.

*blink*

> 12.	I need to check with the BPF folks on the BPF verifier's
> 	definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> 	function might have some redundancy across the board instead
> 	of just on CONFIG_PREEMPT_RCU=y.  Or might not.
>
> 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> 	might need to do something for non-preemptible RCU to make
> 	up for the lack of cond_resched() calls.  Maybe just drop the
> 	"IS_ENABLED()" and execute the body of the current "if" statement
> 	unconditionally.

Again. There is no non-preemtible RCU with this model, unless I'm
missing something important here.

> 15.	I must defer to others on the mm/pgtable-generic.c file's
> 	#ifdef that depends on CONFIG_PREEMPT_RCU.

All those ifdefs should die :)

> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.

Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO.

> I am sure that I am missing something, but I have not yet seen any
> show-stoppers.  Just some needed adjustments.

Right. If it works out as I think it can work out the main adjustments
are to remove a large amount of #ifdef maze and related gunk :)

Thanks,

        tglx
Steven Rostedt Oct. 18, 2023, 2:31 p.m. UTC | #102
On Wed, 18 Oct 2023 15:16:12 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > 	might need to do something for non-preemptible RCU to make
> > 	up for the lack of cond_resched() calls.  Maybe just drop the
> > 	"IS_ENABLED()" and execute the body of the current "if" statement
> > 	unconditionally.  

Right.

I'm guessing you are talking about this code:

                /*
                 * In some cases, notably when running on a nohz_full CPU with
                 * a stopped tick PREEMPT_RCU has no way to account for QSs.
                 * This will eventually cause unwarranted noise as PREEMPT_RCU
                 * will force preemption as the means of ending the current
                 * grace period. We avoid this problem by calling
                 * rcu_momentary_dyntick_idle(), which performs a zero duration
                 * EQS allowing PREEMPT_RCU to end the current grace period.
                 * This call shouldn't be wrapped inside an RCU critical
                 * section.
                 *
                 * Note that in non PREEMPT_RCU kernels QSs are handled through
                 * cond_resched()
                 */
                if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
                        if (!disable_irq)
                                local_irq_disable();

                        rcu_momentary_dyntick_idle();

                        if (!disable_irq)
                                local_irq_enable();
                }

                /*
                 * For the non-preemptive kernel config: let threads runs, if
                 * they so wish, unless set not do to so.
                 */
                if (!disable_irq && !disable_preemption)
                        cond_resched();



If everything becomes PREEMPT_RCU, then the above should be able to be
turned into just:

                if (!disable_irq)
                        local_irq_disable();

                rcu_momentary_dyntick_idle();

                if (!disable_irq)
                        local_irq_enable();

And no cond_resched() is needed.

> 
> Again. There is no non-preemtible RCU with this model, unless I'm
> missing something important here.

Daniel?

-- Steve
Paul E. McKenney Oct. 18, 2023, 5:19 p.m. UTC | #103
On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> Paul!
> 
> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> > Belatedly calling out some RCU issues.  Nothing fatal, just a
> > (surprisingly) few adjustments that will need to be made.  The key thing
> > to note is that from RCU's viewpoint, with this change, all kernels
> > are preemptible, though rcu_read_lock() readers remain
> > non-preemptible.
> 
> Why? Either I'm confused or you or both of us :)

Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
as preempt_enable() in this approach?  I certainly hope so, as RCU
priority boosting would be a most unwelcome addition to many datacenter
workloads.

> With this approach the kernel is by definition fully preemptible, which
> means means rcu_read_lock() is preemptible too. That's pretty much the
> same situation as with PREEMPT_DYNAMIC.

Please, just no!!!

Please note that the current use of PREEMPT_DYNAMIC with preempt=none
avoids preempting RCU read-side critical sections.  This means that the
distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
of RCU readers in environments expecting no preemption.

> For throughput sake this fully preemptible kernel provides a mechanism
> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
> 
> That means the preemption points in preempt_enable() and return from
> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> completion either to the point where they call schedule() or when they
> return to user space. That's pretty much what PREEMPT_NONE does today.
> 
> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> points are not longer required because the scheduler can preempt the
> long running task by setting NEED_RESCHED instead.
> 
> That preemption might be suboptimal in some cases compared to
> cond_resched(), but from my initial experimentation that's not really an
> issue.

I am not (repeat NOT) arguing for keeping cond_resched().  I am instead
arguing that the less-preemptible variants of the kernel should continue
to avoid preempting RCU read-side critical sections.

> > With that:
> >
> > 1.	As an optimization, given that preempt_count() would always give
> > 	good information, the scheduling-clock interrupt could sense RCU
> > 	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
> > 	IPI handlers for expedited grace periods.  A nice optimization.
> > 	Except that...
> >
> > 2.	The quiescent-state-forcing code currently relies on the presence
> > 	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
> > 	would be to do resched_cpu() more quickly, but some workloads
> > 	might not love the additional IPIs.  Another approach to do #1
> > 	above to replace the quiescent states from cond_resched() with
> > 	scheduler-tick-interrupt-sensed quiescent states.
> 
> Right. The tick can see either the lazy resched bit "ignored" or some
> magic "RCU needs a quiescent state" and force a reschedule.

Good, thank you for confirming.

> > 	Plus...
> >
> > 3.	For nohz_full CPUs that run for a long time in the kernel,
> > 	there are no scheduling-clock interrupts.  RCU reaches for
> > 	the resched_cpu() hammer a few jiffies into the grace period.
> > 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> > 	interrupt-entry code will re-enable its scheduling-clock interrupt
> > 	upon receiving the resched_cpu() IPI.
> 
> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
> will cause it to preempt.

That is not sufficient for nohz_full CPUs executing in userspace, which
won't see that NEED_RESCHED until they either take an interrupt or do
a system call.  And applications often work hard to prevent nohz_full
CPUs from doing either.

Please note that if the holdout CPU really is a nohz_full CPU executing
in userspace, RCU will see this courtesy of context tracking and will
therefore avoid ever IPIin it.  The IPIs only happen if a nohz_full
CPU ends up executing for a long time in the kernel, which is an error
condition for the nohz_full use cases that I am aware of.

> > 	So nohz_full CPUs should be OK as far as RCU is concerned.
> > 	Other subsystems might have other opinions.
> >
> > 4.	As another optimization, kvfree_rcu() could unconditionally
> > 	check preempt_count() to sense a clean environment suitable for
> > 	memory allocation.
> 
> Correct. All the limitations of preempt count being useless are gone.

Woo-hoo!!!  And that is of course a very attractive property of this.

> > 5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
> > 	instead say "select TASKS_RCU".  This means that the #else
> > 	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> > 	vanilla RCU must go.  There might be be some fallout if something
> > 	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> > 	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> > 	rcu_tasks_classic_qs() do do something useful.
> 
> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> CONFIG_RT or such as it does not really change the preemption
> model itself. RT just reduces the preemption disabled sections with the
> lock conversions, forced interrupt threading and some more.

Again, please, no.

There are situations where we still need rcu_read_lock() and
rcu_read_unlock() to be preempt_disable() and preempt_enable(),
repectively.  Those can be cases selected only by Kconfig option, not
available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.

> > 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> > 	to stick around.  The reason is that RCU Tasks readers are ended
> > 	only by voluntary context switches.  This means that although a
> > 	preemptible infinite loop in the kernel won't inconvenience a
> > 	real-time task (nor an non-real-time task for all that long),
> > 	and won't delay grace periods for the other flavors of RCU,
> > 	it would indefinitely delay an RCU Tasks grace period.
> >
> > 	However, RCU Tasks grace periods seem to be finite in preemptible
> > 	kernels today, so they should remain finite in limited-preemptible
> > 	kernels tomorrow.  Famous last words...
> 
> That's an issue which you have today with preempt FULL, right? So if it
> turns out to be a problem then it's not a problem of the new model.

Agreed, and hence my last three lines of text above.  Plus the guy who
requested RCU Tasks said that it was OK for its grace periods to take
a long time, and I am holding Steven Rostedt to that.  ;-)

> > 7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> > 	any algorithmic difference from this change.
> >
> > 8.	As has been noted elsewhere, in this new limited-preemption
> > 	mode of operation, rcu_read_lock() readers remain preemptible.
> > 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
> 
> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?

That is in fact the problem.  Preemption can be good, but it is possible
to have too much of a good thing, and preemptible RCU read-side critical
sections definitely is in that category for some important workloads.  ;-)

> > 9.	The rcu_preempt_depth() macro could do something useful in
> > 	limited-preemption kernels.  Its current lack of ability in
> > 	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
> 
> Correct.
> 
> > 10.	The cond_resched_rcu() function must remain because we still
> > 	have non-preemptible rcu_read_lock() readers.
> 
> Where?

In datacenters.

> > 11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> > 	unchanged, but I must defer to the include/net/ip_vs.h people.
> 
> *blink*

No argument here.  ;-)

> > 12.	I need to check with the BPF folks on the BPF verifier's
> > 	definition of BTF_ID(func, rcu_read_unlock_strict).
> >
> > 13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> > 	function might have some redundancy across the board instead
> > 	of just on CONFIG_PREEMPT_RCU=y.  Or might not.
> >
> > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > 	might need to do something for non-preemptible RCU to make
> > 	up for the lack of cond_resched() calls.  Maybe just drop the
> > 	"IS_ENABLED()" and execute the body of the current "if" statement
> > 	unconditionally.
> 
> Again. There is no non-preemtible RCU with this model, unless I'm
> missing something important here.

And again, there needs to be non-preemptible RCU with this model.

> > 15.	I must defer to others on the mm/pgtable-generic.c file's
> > 	#ifdef that depends on CONFIG_PREEMPT_RCU.
> 
> All those ifdefs should die :)

Like all things, they will eventually.  ;-)

> > While in the area, I noted that KLP seems to depend on cond_resched(),
> > but on this I must defer to the KLP people.
> 
> Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO.

Not rocket science, just KLP science, which I am happy to defer to the
KLP people.

> > I am sure that I am missing something, but I have not yet seen any
> > show-stoppers.  Just some needed adjustments.
> 
> Right. If it works out as I think it can work out the main adjustments
> are to remove a large amount of #ifdef maze and related gunk :)

Just please don't remove the #ifdef gunk that is still needed!

							Thanx, Paul
Steven Rostedt Oct. 18, 2023, 5:41 p.m. UTC | #104
On Wed, 18 Oct 2023 10:19:53 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach?  I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.
> 
> > With this approach the kernel is by definition fully preemptible, which
> > means means rcu_read_lock() is preemptible too. That's pretty much the
> > same situation as with PREEMPT_DYNAMIC.  
> 
> Please, just no!!!

Note, when I first read Thomas's proposal, I figured that Paul would no
longer get to brag that:

 "In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply
 nops!"

But instead, they would be:

static void rcu_read_lock(void)
{
	preempt_disable();
}

static void rcu_read_unlock(void)
{
	preempt_enable();
}

as it was mentioned that today's preempt_disable() is fast and not an issue
like it was in older kernels.

That would mean that there will still be a "non preempt" version of RCU.

As the preempt version of RCU adds a lot more logic when scheduling out in
an RCU critical section, that I can envision not all workloads would want
around. Adding "preempt_disable()" is now low overhead, but adding the RCU
logic to handle preemption isn't as lightweight as that.

Not to mention the logic to boost those threads that were preempted and
being starved for some time.



> > > 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > > 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> > > 	to stick around.  The reason is that RCU Tasks readers are ended
> > > 	only by voluntary context switches.  This means that although a
> > > 	preemptible infinite loop in the kernel won't inconvenience a
> > > 	real-time task (nor an non-real-time task for all that long),
> > > 	and won't delay grace periods for the other flavors of RCU,
> > > 	it would indefinitely delay an RCU Tasks grace period.
> > >
> > > 	However, RCU Tasks grace periods seem to be finite in preemptible
> > > 	kernels today, so they should remain finite in limited-preemptible
> > > 	kernels tomorrow.  Famous last words...  
> > 
> > That's an issue which you have today with preempt FULL, right? So if it
> > turns out to be a problem then it's not a problem of the new model.  
> 
> Agreed, and hence my last three lines of text above.  Plus the guy who
> requested RCU Tasks said that it was OK for its grace periods to take
> a long time, and I am holding Steven Rostedt to that.  ;-)

Matters what your definition of "long time" is ;-)

-- Steve
Paul E. McKenney Oct. 18, 2023, 5:51 p.m. UTC | #105
On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
> >> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> >> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
> >> >> That said - I think as a proof of concept and "look, with this we get
> >> >> the expected scheduling event counts", that patch is perfect. I think
> >> >> you more than proved the concept.
> >> >
> >> > There is certainly quite some analyis work to do to make this a one to
> >> > one replacement.
> >> >
> >> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> >> > is pretty much on par with the current mainline variants (NONE/FULL),
> >> > but the memtier benchmark makes a massive dent.
> >> >
> >> > It sports a whopping 10% regression with the LAZY mode versus the mainline
> >> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
> >> >
> >> > That benchmark is really sensitive to the preemption model. With current
> >> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> >> > performance drop versus preempt=NONE.
> >>
> >> That 20% was a tired pilot error. The real number is in the 5% ballpark.
> >>
> >> > I have no clue what's going on there yet, but that shows that there is
> >> > obviously quite some work ahead to get this sorted.
> >>
> >> It took some head scratching to figure that out. The initial fix broke
> >> the handling of the hog issue, i.e. the problem that Ankur tried to
> >> solve, but I hacked up a "solution" for that too.
> >>
> >> With that the memtier benchmark is roughly back to the mainline numbers,
> >> but my throughput benchmark know how is pretty close to zero, so that
> >> should be looked at by people who actually understand these things.
> >>
> >> Likewise the hog prevention is just at the PoC level and clearly beyond
> >> my knowledge of scheduler details: It unconditionally forces a
> >> reschedule when the looping task is not responding to a lazy reschedule
> >> request before the next tick. IOW it forces a reschedule on the second
> >> tick, which is obviously different from the cond_resched()/might_sleep()
> >> behaviour.
> >>
> >> The changes vs. the original PoC aside of the bug and thinko fixes:
> >>
> >>     1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
> >>        lazy preempt bit as the trace_entry::flags field is full already.
> >>
> >>        That obviously breaks the tracer ABI, but if we go there then
> >>        this needs to be fixed. Steven?
> >>
> >>     2) debugfs file to validate that loops can be force preempted w/o
> >>        cond_resched()
> >>
> >>        The usage is:
> >>
> >>        # taskset -c 1 bash
> >>        # echo 1 > /sys/kernel/debug/sched/hog &
> >>        # echo 1 > /sys/kernel/debug/sched/hog &
> >>        # echo 1 > /sys/kernel/debug/sched/hog &
> >>
> >>        top shows ~33% CPU for each of the hogs and tracing confirms that
> >>        the crude hack in the scheduler tick works:
> >>
> >>             bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
> >>             bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
> >>             bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
> >>             bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
> >>             bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
> >>             bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
> >>             bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
> >>             bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr
> >>
> >>        The 'l' instead of the usual 'N' reflects that the lazy resched
> >>        bit is set. That makes __update_curr() invoke resched_curr()
> >>        instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
> >>        and folds it into preempt_count so that preemption happens at the
> >>        next possible point, i.e. either in return from interrupt or at
> >>        the next preempt_enable().
> >
> > Belatedly calling out some RCU issues.  Nothing fatal, just a
> > (surprisingly) few adjustments that will need to be made.  The key thing
> > to note is that from RCU's viewpoint, with this change, all kernels
> > are preemptible, though rcu_read_lock() readers remain non-preemptible.
> 
> Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models
> none/voluntary/full are just scheduler tweaks on top of that. And, so
> this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock()
> readers be preemptible?
> 
> (An alternate configuration might be:
>    config PREEMPT_NONE
>       select PREEMPT_COUNT
> 
>     config PREEMPT_FULL
>       select PREEMPTION
> 
>  This probably allows for more configuration flexibility across archs?
>  Would allow for TREE_RCU=y, for instance. That said, so far I've only
>  been working with PREEMPT_RCU=y.)

Then this is a bug that needs to be fixed.  We need a way to make
RCU readers non-preemptible.

> > With that:
> >
> > 1.	As an optimization, given that preempt_count() would always give
> > 	good information, the scheduling-clock interrupt could sense RCU
> > 	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
> > 	IPI handlers for expedited grace periods.  A nice optimization.
> > 	Except that...
> >
> > 2.	The quiescent-state-forcing code currently relies on the presence
> > 	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
> > 	would be to do resched_cpu() more quickly, but some workloads
> > 	might not love the additional IPIs.  Another approach to do #1
> > 	above to replace the quiescent states from cond_resched() with
> > 	scheduler-tick-interrupt-sensed quiescent states.
> 
> Right, the call to rcu_all_qs(). Just to see if I have it straight,
> something like this for PREEMPT_RCU=n kernels?
> 
>    if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0)
>            rcu_all_qs();
> 
> (Masked because PREEMPT_NONE might not do any folding for
> NEED_RESCHED_LAZY in the tick.)
> 
> Though the comment around rcu_all_qs() mentions that rcu_all_qs()
> reports a quiescent state only if urgently needed. Given that the tick
> executes less frequently than calls to cond_resched(), could we just
> always report instead? Or I'm completely on the wrong track?
> 
>    if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) {
>              preempt_disable();
>              rcu_qs();
>              preempt_enable();
>    }
> 
> On your point about the preempt_count() being dependable, there's a
> wrinkle. As Linus mentions in
> https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/,
> that might not be true for architectures that define ARCH_NO_PREEMPT.
> 
> My plan was to limit those archs to do preemption only at user space boundary
> but there are almost certainly RCU implications that I missed.

Just add this to the "if" condition of the CONFIG_PREEMPT_RCU=n version
of rcu_flavor_sched_clock_irq():

	|| !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))

Resulting in something like this:

------------------------------------------------------------------------

static void rcu_flavor_sched_clock_irq(int user)
{
	if (user || rcu_is_cpu_rrupt_from_idle() ||
	    !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {

		/*
		 * Get here if this CPU took its interrupt from user
		 * mode or from the idle loop, and if this is not a nested
		 * interrupt, or if the interrupt is from a preemptible
		 * region of the kernel.  In this case, the CPU is in a
		 * quiescent state, so note it.
		 *
		 * No memory barrier is required here because rcu_qs()
		 * references only CPU-local variables that other CPUs
		 * neither access nor modify, at least not while the
		 * corresponding CPU is online.
		 */
		rcu_qs();
	}
}

------------------------------------------------------------------------

> > 	Plus...
> >
> > 3.	For nohz_full CPUs that run for a long time in the kernel,
> > 	there are no scheduling-clock interrupts.  RCU reaches for
> > 	the resched_cpu() hammer a few jiffies into the grace period.
> > 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> > 	interrupt-entry code will re-enable its scheduling-clock interrupt
> > 	upon receiving the resched_cpu() IPI.
> >
> > 	So nohz_full CPUs should be OK as far as RCU is concerned.
> > 	Other subsystems might have other opinions.
> 
> Ah, that's what I thought from my reading of the RCU comments. Good to
> have that confirmed. Thanks.
> 
> > 4.	As another optimization, kvfree_rcu() could unconditionally
> > 	check preempt_count() to sense a clean environment suitable for
> > 	memory allocation.
> 
> Had missed this completely. Could you elaborate?

It is just an optimization.  But the idea is to use less restrictive
GFP_ flags in add_ptr_to_bulk_krc_lock() when the caller's context
allows it.  Add Uladzislau on CC for his thoughts.

> > 5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
> > 	instead say "select TASKS_RCU".  This means that the #else
> > 	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> > 	vanilla RCU must go.  There might be be some fallout if something
> > 	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> > 	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> > 	rcu_tasks_classic_qs() do do something useful.
> 
> Ack.
> 
> > 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> > 	to stick around.  The reason is that RCU Tasks readers are ended
> > 	only by voluntary context switches.  This means that although a
> > 	preemptible infinite loop in the kernel won't inconvenience a
> > 	real-time task (nor an non-real-time task for all that long),
> > 	and won't delay grace periods for the other flavors of RCU,
> > 	it would indefinitely delay an RCU Tasks grace period.
> >
> > 	However, RCU Tasks grace periods seem to be finite in preemptible
> > 	kernels today, so they should remain finite in limited-preemptible
> > 	kernels tomorrow.  Famous last words...
> >
> > 7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> > 	any algorithmic difference from this change.
> 
> So, essentially, as long as RCU tasks eventually, in the fullness of
> time, call schedule(), removing cond_resched() shouldn't have any
> effect :).

Almost.

SRCU and RCU Tasks Trace have explicit read-side state changes that
the corresponding grace-period code can detect, one way or another,
and thus is not dependent on reschedules.  RCU Tasks Rude does explicit
reschedules on all CPUs (hence "Rude"), and thus doesn't have to care
about whether or not other things do reschedules.

> > 8.	As has been noted elsewhere, in this new limited-preemption
> > 	mode of operation, rcu_read_lock() readers remain preemptible.
> > 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
> 
> Ack.
> 
> > 9.	The rcu_preempt_depth() macro could do something useful in
> > 	limited-preemption kernels.  Its current lack of ability in
> > 	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
> >
> > 10.	The cond_resched_rcu() function must remain because we still
> > 	have non-preemptible rcu_read_lock() readers.
> 
> For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need
> only be this, right?:
> 
>    static inline void cond_resched_rcu(void)
>    {
>    #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
>            rcu_read_unlock();
> 
>            rcu_read_lock();
>    #endif
>    }

There is a good chance that it will also need to do an explicit
rcu_all_qs().  The problem is that there is an extremely low probability
that the scheduling clock interrupt will hit that space between the
rcu_read_unlock() and rcu_read_lock().

But either way, not a showstopper.

> > 11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> > 	unchanged, but I must defer to the include/net/ip_vs.h people.
> >
> > 12.	I need to check with the BPF folks on the BPF verifier's
> > 	definition of BTF_ID(func, rcu_read_unlock_strict).
> >
> > 13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> > 	function might have some redundancy across the board instead
> > 	of just on CONFIG_PREEMPT_RCU=y.  Or might not.
> 
> I don't think I understand any of these well enough to comment. Will
> Cc the relevant folks when I send out the RFC.

Sounds like a plan to me!  ;-)

> > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > 	might need to do something for non-preemptible RCU to make
> > 	up for the lack of cond_resched() calls.  Maybe just drop the
> > 	"IS_ENABLED()" and execute the body of the current "if" statement
> > 	unconditionally.
> 
> Aah, yes this is a good idea. Thanks.
> 
> > 15.	I must defer to others on the mm/pgtable-generic.c file's
> > 	#ifdef that depends on CONFIG_PREEMPT_RCU.
> >
> > While in the area, I noted that KLP seems to depend on cond_resched(),
> > but on this I must defer to the KLP people.
> 
> Yeah, as part of this work, I ended up unhooking most of the KLP
> hooks in cond_resched() and of course, cond_resched() itself.
> Will poke the livepatching people.

Again, sounds like a plan to me!

> > I am sure that I am missing something, but I have not yet seen any
> > show-stoppers.  Just some needed adjustments.
> 
> Appreciate this detailed list. Makes me think that everything might
> not go up in smoke after all!

C'mon, Ankur, if it doesn't go up in smoke at some point, you just
aren't trying hard enough!  ;-)

							Thanx, Paul

> Thanks
> Ankur
> 
> > Thoughts?
> >
> > 							Thanx, Paul
> >
> >> That's as much as I wanted to demonstrate and I'm not going to spend
> >> more cycles on it as I have already too many other things on flight and
> >> the resulting scheduler woes are clearly outside of my expertice.
> >>
> >> Though definitely I'm putting a permanent NAK in place for any attempts
> >> to duct tape the preempt=NONE model any further by sprinkling more
> >> cond*() and whatever warts around.
> >>
> >> Thanks,
> >>
> >>         tglx
> >> ---
> >>  arch/x86/Kconfig                   |    1
> >>  arch/x86/include/asm/thread_info.h |    6 ++--
> >>  drivers/acpi/processor_idle.c      |    2 -
> >>  include/linux/entry-common.h       |    2 -
> >>  include/linux/entry-kvm.h          |    2 -
> >>  include/linux/sched.h              |   12 +++++---
> >>  include/linux/sched/idle.h         |    8 ++---
> >>  include/linux/thread_info.h        |   24 +++++++++++++++++
> >>  include/linux/trace_events.h       |    8 ++---
> >>  kernel/Kconfig.preempt             |   17 +++++++++++-
> >>  kernel/entry/common.c              |    4 +-
> >>  kernel/entry/kvm.c                 |    2 -
> >>  kernel/sched/core.c                |   51 +++++++++++++++++++++++++------------
> >>  kernel/sched/debug.c               |   19 +++++++++++++
> >>  kernel/sched/fair.c                |   46 ++++++++++++++++++++++-----------
> >>  kernel/sched/features.h            |    2 +
> >>  kernel/sched/idle.c                |    3 --
> >>  kernel/sched/sched.h               |    1
> >>  kernel/trace/trace.c               |    2 +
> >>  kernel/trace/trace_output.c        |   16 ++++++++++-
> >>  20 files changed, 171 insertions(+), 57 deletions(-)
> >>
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
> >>
> >>  #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
> >>  /*
> >> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
> >> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
> >>   * this avoids any races wrt polling state changes and thereby avoids
> >>   * spurious IPIs.
> >>   */
> >> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> >>  {
> >>  	struct thread_info *ti = task_thread_info(p);
> >> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> >> +
> >> +	return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
> >>  }
> >>
> >>  /*
> >> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
> >>  	for (;;) {
> >>  		if (!(val & _TIF_POLLING_NRFLAG))
> >>  			return false;
> >> -		if (val & _TIF_NEED_RESCHED)
> >> +		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> >>  			return true;
> >>  		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
> >>  			break;
> >> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
> >>  }
> >>
> >>  #else
> >> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
> >>  {
> >> -	set_tsk_need_resched(p);
> >> +	set_tsk_thread_flag(p, tif_bit);
> >>  	return true;
> >>  }
> >>
> >> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
> >>   * might also involve a cross-CPU call to trigger the scheduler on
> >>   * the target CPU.
> >>   */
> >> -void resched_curr(struct rq *rq)
> >> +static void __resched_curr(struct rq *rq, int lazy)
> >>  {
> >> +	int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
> >>  	struct task_struct *curr = rq->curr;
> >> -	int cpu;
> >>
> >>  	lockdep_assert_rq_held(rq);
> >>
> >> -	if (test_tsk_need_resched(curr))
> >> +	if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
> >>  		return;
> >>
> >>  	cpu = cpu_of(rq);
> >>
> >>  	if (cpu == smp_processor_id()) {
> >> -		set_tsk_need_resched(curr);
> >> -		set_preempt_need_resched();
> >> +		set_tsk_thread_flag(curr, tif_bit);
> >> +		if (!lazy)
> >> +			set_preempt_need_resched();
> >>  		return;
> >>  	}
> >>
> >> -	if (set_nr_and_not_polling(curr))
> >> -		smp_send_reschedule(cpu);
> >> -	else
> >> +	if (set_nr_and_not_polling(curr, tif_bit)) {
> >> +		if (!lazy)
> >> +			smp_send_reschedule(cpu);
> >> +	} else {
> >>  		trace_sched_wake_idle_without_ipi(cpu);
> >> +	}
> >> +}
> >> +
> >> +void resched_curr(struct rq *rq)
> >> +{
> >> +	__resched_curr(rq, 0);
> >> +}
> >> +
> >> +void resched_curr_lazy(struct rq *rq)
> >> +{
> >> +	int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
> >> +		TIF_NEED_RESCHED_LAZY_OFFSET : 0;
> >> +
> >> +	if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
> >> +		return;
> >> +
> >> +	__resched_curr(rq, lazy);
> >>  }
> >>
> >>  void resched_cpu(int cpu)
> >> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
> >>  	if (cpu == smp_processor_id())
> >>  		return;
> >>
> >> -	if (set_nr_and_not_polling(rq->idle))
> >> +	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
> >>  		smp_send_reschedule(cpu);
> >>  	else
> >>  		trace_sched_wake_idle_without_ipi(cpu);
> >> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
> >>  		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> >>  		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
> >>  	}									 \
> >> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
> >>
> >>  PREEMPT_MODEL_ACCESSOR(none);
> >>  PREEMPT_MODEL_ACCESSOR(voluntary);
> >> --- a/include/linux/thread_info.h
> >> +++ b/include/linux/thread_info.h
> >> @@ -59,6 +59,16 @@ enum syscall_work_bit {
> >>
> >>  #include <asm/thread_info.h>
> >>
> >> +#ifdef CONFIG_PREEMPT_AUTO
> >> +# define TIF_NEED_RESCHED_LAZY		TIF_ARCH_RESCHED_LAZY
> >> +# define _TIF_NEED_RESCHED_LAZY		_TIF_ARCH_RESCHED_LAZY
> >> +# define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
> >> +#else
> >> +# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
> >> +# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
> >> +# define TIF_NEED_RESCHED_LAZY_OFFSET	0
> >> +#endif
> >> +
> >>  #ifdef __KERNEL__
> >>
> >>  #ifndef arch_set_restart_data
> >> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
> >>  			     (unsigned long *)(&current_thread_info()->flags));
> >>  }
> >>
> >> +static __always_inline bool tif_need_resched_lazy(void)
> >> +{
> >> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> >> +		arch_test_bit(TIF_NEED_RESCHED_LAZY,
> >> +			      (unsigned long *)(&current_thread_info()->flags));
> >> +}
> >> +
> >>  #else
> >>
> >>  static __always_inline bool tif_need_resched(void)
> >> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
> >>  			(unsigned long *)(&current_thread_info()->flags));
> >>  }
> >>
> >> +static __always_inline bool tif_need_resched_lazy(void)
> >> +{
> >> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
> >> +		test_bit(TIF_NEED_RESCHED_LAZY,
> >> +			 (unsigned long *)(&current_thread_info()->flags));
> >> +}
> >> +
> >>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
> >>
> >>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> >> --- a/kernel/Kconfig.preempt
> >> +++ b/kernel/Kconfig.preempt
> >> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
> >>  	select PREEMPTION
> >>  	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
> >>
> >> +config PREEMPT_BUILD_AUTO
> >> +	bool
> >> +	select PREEMPT_BUILD
> >> +
> >> +config HAVE_PREEMPT_AUTO
> >> +	bool
> >> +
> >>  choice
> >>  	prompt "Preemption Model"
> >>  	default PREEMPT_NONE
> >> @@ -67,9 +74,17 @@ config PREEMPT
> >>  	  embedded system with latency requirements in the milliseconds
> >>  	  range.
> >>
> >> +config PREEMPT_AUTO
> >> +	bool "Automagic preemption mode with runtime tweaking support"
> >> +	depends on HAVE_PREEMPT_AUTO
> >> +	select PREEMPT_BUILD_AUTO
> >> +	help
> >> +	  Add some sensible blurb here
> >> +
> >>  config PREEMPT_RT
> >>  	bool "Fully Preemptible Kernel (Real-Time)"
> >>  	depends on EXPERT && ARCH_SUPPORTS_RT
> >> +	select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
> >>  	select PREEMPTION
> >>  	help
> >>  	  This option turns the kernel into a real-time kernel by replacing
> >> @@ -95,7 +110,7 @@ config PREEMPTION
> >>
> >>  config PREEMPT_DYNAMIC
> >>  	bool "Preemption behaviour defined on boot"
> >> -	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
> >> +	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
> >>  	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
> >>  	select PREEMPT_BUILD
> >>  	default y if HAVE_PREEMPT_DYNAMIC_CALL
> >> --- a/include/linux/entry-common.h
> >> +++ b/include/linux/entry-common.h
> >> @@ -60,7 +60,7 @@
> >>  #define EXIT_TO_USER_MODE_WORK						\
> >>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> >>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> >> -	 ARCH_EXIT_TO_USER_MODE_WORK)
> >> +	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
> >>
> >>  /**
> >>   * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
> >> --- a/include/linux/entry-kvm.h
> >> +++ b/include/linux/entry-kvm.h
> >> @@ -18,7 +18,7 @@
> >>
> >>  #define XFER_TO_GUEST_MODE_WORK						\
> >>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
> >> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> >> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
> >>
> >>  struct kvm_vcpu;
> >>
> >> --- a/kernel/entry/common.c
> >> +++ b/kernel/entry/common.c
> >> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
> >>
> >>  		local_irq_enable_exit_to_user(ti_work);
> >>
> >> -		if (ti_work & _TIF_NEED_RESCHED)
> >> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> >>  			schedule();
> >>
> >>  		if (ti_work & _TIF_UPROBE)
> >> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
> >>  		rcu_irq_exit_check_preempt();
> >>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> >>  			WARN_ON_ONCE(!on_thread_stack());
> >> -		if (need_resched())
> >> +		if (test_tsk_need_resched(current))
> >>  			preempt_schedule_irq();
> >>  	}
> >>  }
> >> --- a/kernel/sched/features.h
> >> +++ b/kernel/sched/features.h
> >> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
> >>  SCHED_FEAT(LATENCY_WARN, false)
> >>
> >>  SCHED_FEAT(HZ_BW, true)
> >> +
> >> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
> >>  extern void reweight_task(struct task_struct *p, int prio);
> >>
> >>  extern void resched_curr(struct rq *rq);
> >> +extern void resched_curr_lazy(struct rq *rq);
> >>  extern void resched_cpu(int cpu);
> >>
> >>  extern struct rt_bandwidth def_rt_bandwidth;
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
> >>  	update_ti_thread_flag(task_thread_info(tsk), flag, value);
> >>  }
> >>
> >> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
> >>  {
> >>  	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
> >>  }
> >>
> >> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
> >>  {
> >>  	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
> >>  }
> >>
> >> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
> >> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
> >>  {
> >>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
> >>  }
> >> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
> >>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
> >>  {
> >>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> >> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> >> +		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
> >>  }
> >>
> >> -static inline int test_tsk_need_resched(struct task_struct *tsk)
> >> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
> >>  {
> >>  	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
> >>  }
> >> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
> >>
> >>  static __always_inline bool need_resched(void)
> >>  {
> >> -	return unlikely(tif_need_resched());
> >> +	return unlikely(tif_need_resched_lazy() || tif_need_resched());
> >>  }
> >>
> >>  /*
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
> >>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> >>   * this is probably good enough.
> >>   */
> >> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
> >>  {
> >> +	struct rq *rq = rq_of(cfs_rq);
> >> +
> >>  	if ((s64)(se->vruntime - se->deadline) < 0)
> >>  		return;
> >>
> >> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
> >>  	/*
> >>  	 * The task has consumed its request, reschedule.
> >>  	 */
> >> -	if (cfs_rq->nr_running > 1) {
> >> -		resched_curr(rq_of(cfs_rq));
> >> -		clear_buddies(cfs_rq, se);
> >> +	if (cfs_rq->nr_running < 2)
> >> +		return;
> >> +
> >> +	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
> >> +		resched_curr(rq);
> >> +	} else {
> >> +		/* Did the task ignore the lazy reschedule request? */
> >> +		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
> >> +			resched_curr(rq);
> >> +		else
> >> +			resched_curr_lazy(rq);
> >>  	}
> >> +	clear_buddies(cfs_rq, se);
> >>  }
> >>
> >>  #include "pelt.h"
> >> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
> >>  /*
> >>   * Update the current task's runtime statistics.
> >>   */
> >> -static void update_curr(struct cfs_rq *cfs_rq)
> >> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
> >>  {
> >>  	struct sched_entity *curr = cfs_rq->curr;
> >>  	u64 now = rq_clock_task(rq_of(cfs_rq));
> >> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
> >>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
> >>
> >>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
> >> -	update_deadline(cfs_rq, curr);
> >> +	update_deadline(cfs_rq, curr, tick);
> >>  	update_min_vruntime(cfs_rq);
> >>
> >>  	if (entity_is_task(curr)) {
> >> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
> >>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
> >>  }
> >>
> >> +static inline void update_curr(struct cfs_rq *cfs_rq)
> >> +{
> >> +	__update_curr(cfs_rq, false);
> >> +}
> >> +
> >>  static void update_curr_fair(struct rq *rq)
> >>  {
> >>  	update_curr(cfs_rq_of(&rq->curr->se));
> >> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> >>  	/*
> >>  	 * Update run-time statistics of the 'current'.
> >>  	 */
> >> -	update_curr(cfs_rq);
> >> +	__update_curr(cfs_rq, true);
> >>
> >>  	/*
> >>  	 * Ensure that runnable average is periodically updated.
> >> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> >>  	 * validating it and just reschedule.
> >>  	 */
> >>  	if (queued) {
> >> -		resched_curr(rq_of(cfs_rq));
> >> +		resched_curr_lazy(rq_of(cfs_rq));
> >>  		return;
> >>  	}
> >>  	/*
> >> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
> >>  	 * hierarchy can be throttled
> >>  	 */
> >>  	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> >> -		resched_curr(rq_of(cfs_rq));
> >> +		resched_curr_lazy(rq_of(cfs_rq));
> >>  }
> >>
> >>  static __always_inline
> >> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
> >>
> >>  	/* Determine whether we need to wake up potentially idle CPU: */
> >>  	if (rq->curr == rq->idle && rq->cfs.nr_running)
> >> -		resched_curr(rq);
> >> +		resched_curr_lazy(rq);
> >>  }
> >>
> >>  #ifdef CONFIG_SMP
> >> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
> >>
> >>  		if (delta < 0) {
> >>  			if (task_current(rq, p))
> >> -				resched_curr(rq);
> >> +				resched_curr_lazy(rq);
> >>  			return;
> >>  		}
> >>  		hrtick_start(rq, delta);
> >> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
> >>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
> >>  	 * below.
> >>  	 */
> >> -	if (test_tsk_need_resched(curr))
> >> +	if (need_resched())
> >>  		return;
> >>
> >>  	/* Idle tasks are by definition preempted by non-idle tasks. */
> >> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
> >>  	return;
> >>
> >>  preempt:
> >> -	resched_curr(rq);
> >> +	resched_curr_lazy(rq);
> >>  }
> >>
> >>  #ifdef CONFIG_SMP
> >> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
> >>  	 */
> >>  	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
> >>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >> -		resched_curr(rq);
> >> +		resched_curr_lazy(rq);
> >>  }
> >>
> >>  /*
> >> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
> >>  	 */
> >>  	if (task_current(rq, p)) {
> >>  		if (p->prio > oldprio)
> >> -			resched_curr(rq);
> >> +			resched_curr_lazy(rq);
> >>  	} else
> >>  		check_preempt_curr(rq, p, 0);
> >>  }
> >> --- a/drivers/acpi/processor_idle.c
> >> +++ b/drivers/acpi/processor_idle.c
> >> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
> >>   */
> >>  static void __cpuidle acpi_safe_halt(void)
> >>  {
> >> -	if (!tif_need_resched()) {
> >> +	if (!need_resched()) {
> >>  		raw_safe_halt();
> >>  		raw_local_irq_disable();
> >>  	}
> >> --- a/include/linux/sched/idle.h
> >> +++ b/include/linux/sched/idle.h
> >> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
> >>  	 */
> >>  	smp_mb__after_atomic();
> >>
> >> -	return unlikely(tif_need_resched());
> >> +	return unlikely(need_resched());
> >>  }
> >>
> >>  static __always_inline bool __must_check current_clr_polling_and_test(void)
> >> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
> >>  	 */
> >>  	smp_mb__after_atomic();
> >>
> >> -	return unlikely(tif_need_resched());
> >> +	return unlikely(need_resched());
> >>  }
> >>
> >>  #else
> >> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
> >>
> >>  static inline bool __must_check current_set_polling_and_test(void)
> >>  {
> >> -	return unlikely(tif_need_resched());
> >> +	return unlikely(need_resched());
> >>  }
> >>  static inline bool __must_check current_clr_polling_and_test(void)
> >>  {
> >> -	return unlikely(tif_need_resched());
> >> +	return unlikely(need_resched());
> >>  }
> >>  #endif
> >>
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
> >>  	ct_cpuidle_enter();
> >>
> >>  	raw_local_irq_enable();
> >> -	while (!tif_need_resched() &&
> >> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
> >> +	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
> >>  		cpu_relax();
> >>  	raw_local_irq_disable();
> >>
> >> --- a/kernel/trace/trace.c
> >> +++ b/kernel/trace/trace.c
> >> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
> >>
> >>  	if (tif_need_resched())
> >>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
> >> +	if (tif_need_resched_lazy())
> >> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
> >>  	if (test_preempt_need_resched())
> >>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
> >>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -271,6 +271,7 @@ config X86
> >>  	select HAVE_STATIC_CALL
> >>  	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
> >>  	select HAVE_PREEMPT_DYNAMIC_CALL
> >> +	select HAVE_PREEMPT_AUTO
> >>  	select HAVE_RSEQ
> >>  	select HAVE_RUST			if X86_64
> >>  	select HAVE_SYSCALL_TRACEPOINTS
> >> --- a/arch/x86/include/asm/thread_info.h
> >> +++ b/arch/x86/include/asm/thread_info.h
> >> @@ -81,8 +81,9 @@ struct thread_info {
> >>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
> >>  #define TIF_SIGPENDING		2	/* signal pending */
> >>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
> >> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
> >> -#define TIF_SSBD		5	/* Speculative store bypass disable */
> >> +#define TIF_ARCH_RESCHED_LAZY	4	/* Lazy rescheduling */
> >> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
> >> +#define TIF_SSBD		6	/* Speculative store bypass disable */
> >>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
> >>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
> >>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
> >> @@ -104,6 +105,7 @@ struct thread_info {
> >>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
> >>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
> >>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> >> +#define _TIF_ARCH_RESCHED_LAZY	(1 << TIF_ARCH_RESCHED_LAZY)
> >>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
> >>  #define _TIF_SSBD		(1 << TIF_SSBD)
> >>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
> >> --- a/kernel/entry/kvm.c
> >> +++ b/kernel/entry/kvm.c
> >> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
> >>  			return -EINTR;
> >>  		}
> >>
> >> -		if (ti_work & _TIF_NEED_RESCHED)
> >> +		if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
> >>  			schedule();
> >>
> >>  		if (ti_work & _TIF_NOTIFY_RESUME)
> >> --- a/include/linux/trace_events.h
> >> +++ b/include/linux/trace_events.h
> >> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
> >>
> >>  enum trace_flag_type {
> >>  	TRACE_FLAG_IRQS_OFF		= 0x01,
> >> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
> >> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
> >> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
> >> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
> >>  	TRACE_FLAG_HARDIRQ		= 0x08,
> >>  	TRACE_FLAG_SOFTIRQ		= 0x10,
> >>  	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
> >> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
> >>
> >>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
> >>  {
> >> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> >> +	return tracing_gen_ctx_irq_test(0);
> >>  }
> >>  static inline unsigned int tracing_gen_ctx(void)
> >>  {
> >> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
> >> +	return tracing_gen_ctx_irq_test(0);
> >>  }
> >>  #endif
> >>
> >> --- a/kernel/trace/trace_output.c
> >> +++ b/kernel/trace/trace_output.c
> >> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
> >>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
> >>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
> >>  		bh_off ? 'b' :
> >> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
> >> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
> >>  		'.';
> >>
> >> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
> >> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
> >>  				TRACE_FLAG_PREEMPT_RESCHED)) {
> >> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> >> +		need_resched = 'B';
> >> +		break;
> >>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
> >>  		need_resched = 'N';
> >>  		break;
> >> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
> >> +		need_resched = 'L';
> >> +		break;
> >> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
> >> +		need_resched = 'b';
> >> +		break;
> >>  	case TRACE_FLAG_NEED_RESCHED:
> >>  		need_resched = 'n';
> >>  		break;
> >> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
> >> +		need_resched = 'l';
> >> +		break;
> >>  	case TRACE_FLAG_PREEMPT_RESCHED:
> >>  		need_resched = 'p';
> >>  		break;
> >> --- a/kernel/sched/debug.c
> >> +++ b/kernel/sched/debug.c
> >> @@ -333,6 +333,23 @@ static const struct file_operations sche
> >>  	.release	= seq_release,
> >>  };
> >>
> >> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
> >> +			       size_t cnt, loff_t *ppos)
> >> +{
> >> +	unsigned long end = jiffies + 60 * HZ;
> >> +
> >> +	for (; time_before(jiffies, end) && !signal_pending(current);)
> >> +		cpu_relax();
> >> +
> >> +	return cnt;
> >> +}
> >> +
> >> +static const struct file_operations sched_hog_fops = {
> >> +	.write		= sched_hog_write,
> >> +	.open		= simple_open,
> >> +	.llseek		= default_llseek,
> >> +};
> >> +
> >>  static struct dentry *debugfs_sched;
> >>
> >>  static __init int sched_init_debug(void)
> >> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
> >>
> >>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
> >>
> >> +	debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
> >> +
> >>  	return 0;
> >>  }
> >>  late_initcall(sched_init_debug);
> >>
> 
> 
> --
> ankur
Paul E. McKenney Oct. 18, 2023, 5:55 p.m. UTC | #106
On Wed, Oct 18, 2023 at 10:31:46AM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 15:16:12 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> > > 	might need to do something for non-preemptible RCU to make
> > > 	up for the lack of cond_resched() calls.  Maybe just drop the
> > > 	"IS_ENABLED()" and execute the body of the current "if" statement
> > > 	unconditionally.  
> 
> Right.
> 
> I'm guessing you are talking about this code:
> 
>                 /*
>                  * In some cases, notably when running on a nohz_full CPU with
>                  * a stopped tick PREEMPT_RCU has no way to account for QSs.
>                  * This will eventually cause unwarranted noise as PREEMPT_RCU
>                  * will force preemption as the means of ending the current
>                  * grace period. We avoid this problem by calling
>                  * rcu_momentary_dyntick_idle(), which performs a zero duration
>                  * EQS allowing PREEMPT_RCU to end the current grace period.
>                  * This call shouldn't be wrapped inside an RCU critical
>                  * section.
>                  *
>                  * Note that in non PREEMPT_RCU kernels QSs are handled through
>                  * cond_resched()
>                  */
>                 if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
>                         if (!disable_irq)
>                                 local_irq_disable();
> 
>                         rcu_momentary_dyntick_idle();
> 
>                         if (!disable_irq)
>                                 local_irq_enable();
>                 }

That is indeed the place!

>                 /*
>                  * For the non-preemptive kernel config: let threads runs, if
>                  * they so wish, unless set not do to so.
>                  */
>                 if (!disable_irq && !disable_preemption)
>                         cond_resched();
> 
> 
> 
> If everything becomes PREEMPT_RCU, then the above should be able to be
> turned into just:
> 
>                 if (!disable_irq)
>                         local_irq_disable();
> 
>                 rcu_momentary_dyntick_idle();
> 
>                 if (!disable_irq)
>                         local_irq_enable();
> 
> And no cond_resched() is needed.

Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
run_osnoise() is running in kthread context with preemption and everything
else enabled (am I right?), then the change you suggest should work fine.

> > Again. There is no non-preemtible RCU with this model, unless I'm
> > missing something important here.
> 
> Daniel?

But very happy to defer to Daniel.  ;-)

							Thanx, Paul
Paul E. McKenney Oct. 18, 2023, 5:59 p.m. UTC | #107
On Wed, Oct 18, 2023 at 01:41:07PM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 10:19:53 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > 
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach?  I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
> > 
> > > With this approach the kernel is by definition fully preemptible, which
> > > means means rcu_read_lock() is preemptible too. That's pretty much the
> > > same situation as with PREEMPT_DYNAMIC.  
> > 
> > Please, just no!!!
> 
> Note, when I first read Thomas's proposal, I figured that Paul would no
> longer get to brag that:
> 
>  "In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply
>  nops!"

I will still be able to brag that in a fully non-preemptible environment,
rcu_read_lock() and rcu_read_unlock() are simply no-ops.  It will
just be that the Linux kernel will no longer be such an environment.
For the moment, anyway, there is still userspace RCU along with a few
other instances of zero-cost RCU readers.  ;-)

> But instead, they would be:
> 
> static void rcu_read_lock(void)
> {
> 	preempt_disable();
> }
> 
> static void rcu_read_unlock(void)
> {
> 	preempt_enable();
> }
> 
> as it was mentioned that today's preempt_disable() is fast and not an issue
> like it was in older kernels.

And they are already defined as you show above in rcupdate.h, albeit
with leading underscores on the function names.

> That would mean that there will still be a "non preempt" version of RCU.

That would be very good!

> As the preempt version of RCU adds a lot more logic when scheduling out in
> an RCU critical section, that I can envision not all workloads would want
> around. Adding "preempt_disable()" is now low overhead, but adding the RCU
> logic to handle preemption isn't as lightweight as that.
> 
> Not to mention the logic to boost those threads that were preempted and
> being starved for some time.

Exactly, thank you!

> > > > 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> > > > 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> > > > 	to stick around.  The reason is that RCU Tasks readers are ended
> > > > 	only by voluntary context switches.  This means that although a
> > > > 	preemptible infinite loop in the kernel won't inconvenience a
> > > > 	real-time task (nor an non-real-time task for all that long),
> > > > 	and won't delay grace periods for the other flavors of RCU,
> > > > 	it would indefinitely delay an RCU Tasks grace period.
> > > >
> > > > 	However, RCU Tasks grace periods seem to be finite in preemptible
> > > > 	kernels today, so they should remain finite in limited-preemptible
> > > > 	kernels tomorrow.  Famous last words...  
> > > 
> > > That's an issue which you have today with preempt FULL, right? So if it
> > > turns out to be a problem then it's not a problem of the new model.  
> > 
> > Agreed, and hence my last three lines of text above.  Plus the guy who
> > requested RCU Tasks said that it was OK for its grace periods to take
> > a long time, and I am holding Steven Rostedt to that.  ;-)
> 
> Matters what your definition of "long time" is ;-)

If RCU Tasks grace-period latency has been acceptable in preemptible
kernels (including all CONFIG_PREEMPT_DYNAMIC=y kernels), your definition
of "long" is sufficiently short.  ;-)

							Thanx, Paul
Steven Rostedt Oct. 18, 2023, 6 p.m. UTC | #108
On Wed, 18 Oct 2023 10:55:02 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > If everything becomes PREEMPT_RCU, then the above should be able to be
> > turned into just:
> > 
> >                 if (!disable_irq)
> >                         local_irq_disable();
> > 
> >                 rcu_momentary_dyntick_idle();
> > 
> >                 if (!disable_irq)
> >                         local_irq_enable();
> > 
> > And no cond_resched() is needed.  
> 
> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> run_osnoise() is running in kthread context with preemption and everything
> else enabled (am I right?), then the change you suggest should work fine.

There's a user space option that lets you run that loop with preemption and/or
interrupts disabled.

> 
> > > Again. There is no non-preemtible RCU with this model, unless I'm
> > > missing something important here.  
> > 
> > Daniel?  
> 
> But very happy to defer to Daniel.  ;-)

But Daniel could also correct me ;-)

-- Steve
Paul E. McKenney Oct. 18, 2023, 6:13 p.m. UTC | #109
On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
> On Wed, 18 Oct 2023 10:55:02 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > If everything becomes PREEMPT_RCU, then the above should be able to be
> > > turned into just:
> > > 
> > >                 if (!disable_irq)
> > >                         local_irq_disable();
> > > 
> > >                 rcu_momentary_dyntick_idle();
> > > 
> > >                 if (!disable_irq)
> > >                         local_irq_enable();
> > > 
> > > And no cond_resched() is needed.  
> > 
> > Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> > run_osnoise() is running in kthread context with preemption and everything
> > else enabled (am I right?), then the change you suggest should work fine.
> 
> There's a user space option that lets you run that loop with preemption and/or
> interrupts disabled.

Ah, thank you.  Then as long as this function is not expecting an RCU
reader to span that call to rcu_momentary_dyntick_idle(), all is well.
This is a kthread, so there cannot be something else expecting an RCU
reader to span that call.

> > > > Again. There is no non-preemtible RCU with this model, unless I'm
> > > > missing something important here.  
> > > 
> > > Daniel?  
> > 
> > But very happy to defer to Daniel.  ;-)
> 
> But Daniel could also correct me ;-)

If he figures out a way that it is broken, he gets to fix it.  ;-)

						Thanx, Paul
Ankur Arora Oct. 18, 2023, 8:15 p.m. UTC | #110
Paul E. McKenney <paulmck@kernel.org> writes:

> On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> Paul!
>>
>> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> > Belatedly calling out some RCU issues.  Nothing fatal, just a
>> > (surprisingly) few adjustments that will need to be made.  The key thing
>> > to note is that from RCU's viewpoint, with this change, all kernels
>> > are preemptible, though rcu_read_lock() readers remain
>> > non-preemptible.
>>
>> Why? Either I'm confused or you or both of us :)
>
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach?  I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.

No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus
PREEMPT_RCU so rcu_read_lock/unlock() would touch the
rcu_read_lock_nesting.  Which is identical to what PREEMPT_DYNAMIC does.

>> With this approach the kernel is by definition fully preemptible, which
>> means means rcu_read_lock() is preemptible too. That's pretty much the
>> same situation as with PREEMPT_DYNAMIC.
>
> Please, just no!!!
>
> Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> avoids preempting RCU read-side critical sections.  This means that the
> distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> of RCU readers in environments expecting no preemption.

Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU,
preempt=none stubs out the actual preemption via __preempt_schedule.

Okay, I see what you are saying.

(Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none,
_cond_resched() doesn't call rcu_all_qs().)

>> For throughput sake this fully preemptible kernel provides a mechanism
>> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
>> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
>>
>> That means the preemption points in preempt_enable() and return from
>> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
>> completion either to the point where they call schedule() or when they
>> return to user space. That's pretty much what PREEMPT_NONE does today.
>>
>> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
>> points are not longer required because the scheduler can preempt the
>> long running task by setting NEED_RESCHED instead.
>>
>> That preemption might be suboptimal in some cases compared to
>> cond_resched(), but from my initial experimentation that's not really an
>> issue.
>
> I am not (repeat NOT) arguing for keeping cond_resched().  I am instead
> arguing that the less-preemptible variants of the kernel should continue
> to avoid preempting RCU read-side critical sections.

[ snip ]

>> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> CONFIG_RT or such as it does not really change the preemption
>> model itself. RT just reduces the preemption disabled sections with the
>> lock conversions, forced interrupt threading and some more.
>
> Again, please, no.
>
> There are situations where we still need rcu_read_lock() and
> rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> repectively.  Those can be cases selected only by Kconfig option, not
> available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.

As far as non-preemptible RCU read-side critical sections are concerned,
are the current
- PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config
  (rcu_read_lock/unlock() do not manipulate preempt_count, but do
   stub out preempt_schedule())
- and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate
   preempt_count)?

roughly similar or no?

>> > I am sure that I am missing something, but I have not yet seen any
>> > show-stoppers.  Just some needed adjustments.
>>
>> Right. If it works out as I think it can work out the main adjustments
>> are to remove a large amount of #ifdef maze and related gunk :)
>
> Just please don't remove the #ifdef gunk that is still needed!

Always the hard part :).

Thanks

--
ankur
Paul E. McKenney Oct. 18, 2023, 8:42 p.m. UTC | #111
On Wed, Oct 18, 2023 at 01:15:28PM -0700, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> Paul!
> >>
> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> > Belatedly calling out some RCU issues.  Nothing fatal, just a
> >> > (surprisingly) few adjustments that will need to be made.  The key thing
> >> > to note is that from RCU's viewpoint, with this change, all kernels
> >> > are preemptible, though rcu_read_lock() readers remain
> >> > non-preemptible.
> >>
> >> Why? Either I'm confused or you or both of us :)
> >
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach?  I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
> 
> No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus
> PREEMPT_RCU so rcu_read_lock/unlock() would touch the
> rcu_read_lock_nesting.  Which is identical to what PREEMPT_DYNAMIC does.

Understood.  And we need some way to build a kernel such that RCU
read-side critical sections are non-preemptible.  This is a hard
requirement that is not going away anytime soon.

> >> With this approach the kernel is by definition fully preemptible, which
> >> means means rcu_read_lock() is preemptible too. That's pretty much the
> >> same situation as with PREEMPT_DYNAMIC.
> >
> > Please, just no!!!
> >
> > Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> > avoids preempting RCU read-side critical sections.  This means that the
> > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> > of RCU readers in environments expecting no preemption.
> 
> Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU,
> preempt=none stubs out the actual preemption via __preempt_schedule.
> 
> Okay, I see what you are saying.

More to the point, currently, you can build with CONFIG_PREEMPT_DYNAMIC=n
and CONFIG_PREEMPT_NONE=y and have non-preemptible RCU read-side critical
sections.

> (Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none,
> _cond_resched() doesn't call rcu_all_qs().)

I have no idea if anyone runs with CONFIG_PREEMPT_DYNAMIC=y and
preempt=none.  We don't do so.  ;-)

> >> For throughput sake this fully preemptible kernel provides a mechanism
> >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
> >>
> >> That means the preemption points in preempt_enable() and return from
> >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> >> completion either to the point where they call schedule() or when they
> >> return to user space. That's pretty much what PREEMPT_NONE does today.
> >>
> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> >> points are not longer required because the scheduler can preempt the
> >> long running task by setting NEED_RESCHED instead.
> >>
> >> That preemption might be suboptimal in some cases compared to
> >> cond_resched(), but from my initial experimentation that's not really an
> >> issue.
> >
> > I am not (repeat NOT) arguing for keeping cond_resched().  I am instead
> > arguing that the less-preemptible variants of the kernel should continue
> > to avoid preempting RCU read-side critical sections.
> 
> [ snip ]
> 
> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> CONFIG_RT or such as it does not really change the preemption
> >> model itself. RT just reduces the preemption disabled sections with the
> >> lock conversions, forced interrupt threading and some more.
> >
> > Again, please, no.
> >
> > There are situations where we still need rcu_read_lock() and
> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > repectively.  Those can be cases selected only by Kconfig option, not
> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> 
> As far as non-preemptible RCU read-side critical sections are concerned,
> are the current
> - PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config
>   (rcu_read_lock/unlock() do not manipulate preempt_count, but do
>    stub out preempt_schedule())
> - and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate
>    preempt_count)?
> 
> roughly similar or no?

No.

There is still considerable exposure to preemptible-RCU code paths,
for example, when current->rcu_read_unlock_special.b.blocked is set.

> >> > I am sure that I am missing something, but I have not yet seen any
> >> > show-stoppers.  Just some needed adjustments.
> >>
> >> Right. If it works out as I think it can work out the main adjustments
> >> are to remove a large amount of #ifdef maze and related gunk :)
> >
> > Just please don't remove the #ifdef gunk that is still needed!
> 
> Always the hard part :).

Hey, we wouldn't want to insult your intelligence by letting you work
on too easy of a problem!  ;-)

						Thanx, Paul
Thomas Gleixner Oct. 18, 2023, 10:53 p.m. UTC | #112
On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:

Can you folks please trim your replies. It's annoying to scroll
through hundreds of quoted lines to figure out that nothing is there.

>>  This probably allows for more configuration flexibility across archs?
>>  Would allow for TREE_RCU=y, for instance. That said, so far I've only
>>  been working with PREEMPT_RCU=y.)
>
> Then this is a bug that needs to be fixed.  We need a way to make
> RCU readers non-preemptible.

Why?
Paul E. McKenney Oct. 18, 2023, 11:25 p.m. UTC | #113
On Thu, Oct 19, 2023 at 12:53:05AM +0200, Thomas Gleixner wrote:
> On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote:
> 
> Can you folks please trim your replies. It's annoying to scroll
> through hundreds of quoted lines to figure out that nothing is there.
> 
> >>  This probably allows for more configuration flexibility across archs?
> >>  Would allow for TREE_RCU=y, for instance. That said, so far I've only
> >>  been working with PREEMPT_RCU=y.)
> >
> > Then this is a bug that needs to be fixed.  We need a way to make
> > RCU readers non-preemptible.
> 
> Why?

So that we don't get tail latencies from preempted RCU readers that
result in memory-usage spikes on systems that have good and sufficient
quantities of memory, but which do not have enough memory to tolerate
readers being preempted.

							Thanx, Paul
Thomas Gleixner Oct. 19, 2023, 12:21 a.m. UTC | #114
Paul!

On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> > Belatedly calling out some RCU issues.  Nothing fatal, just a
>> > (surprisingly) few adjustments that will need to be made.  The key thing
>> > to note is that from RCU's viewpoint, with this change, all kernels
>> > are preemptible, though rcu_read_lock() readers remain
>> > non-preemptible.
>> 
>> Why? Either I'm confused or you or both of us :)
>
> Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> as preempt_enable() in this approach?  I certainly hope so, as RCU
> priority boosting would be a most unwelcome addition to many datacenter
> workloads.

Sure, but that's an orthogonal problem, really.

>> With this approach the kernel is by definition fully preemptible, which
>> means means rcu_read_lock() is preemptible too. That's pretty much the
>> same situation as with PREEMPT_DYNAMIC.
>
> Please, just no!!!
>
> Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> avoids preempting RCU read-side critical sections.  This means that the
> distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> of RCU readers in environments expecting no preemption.

It does not _avoid_ it, it simply _prevents_ it by not preempting in
preempt_enable() and on return from interrupt so whatever sets
NEED_RESCHED has to wait for a voluntary invocation of schedule(),
cond_resched() or return to user space.

But under the hood RCU is fully preemptible and the boosting logic is
active, but it does not have an effect until one of those preemption
points is reached, which makes the boosting moot.

>> For throughput sake this fully preemptible kernel provides a mechanism
>> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
>> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
>> 
>> That means the preemption points in preempt_enable() and return from
>> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
>> completion either to the point where they call schedule() or when they
>> return to user space. That's pretty much what PREEMPT_NONE does today.
>> 
>> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
>> points are not longer required because the scheduler can preempt the
>> long running task by setting NEED_RESCHED instead.
>> 
>> That preemption might be suboptimal in some cases compared to
>> cond_resched(), but from my initial experimentation that's not really an
>> issue.
>
> I am not (repeat NOT) arguing for keeping cond_resched().  I am instead
> arguing that the less-preemptible variants of the kernel should continue
> to avoid preempting RCU read-side critical sections.

That's the whole point of the lazy mechanism:

   It avoids (repeat AVOIDS) preemption of any kernel code as much as it
   can by _not_ setting NEED_RESCHED.

   The only difference is that it does not _prevent_ it like
   preempt=none does. It will preempt when NEED_RESCHED is set.

Now the question is when will NEED_RESCHED be set?

   1) If the preempting task belongs to a scheduling class above
      SCHED_OTHER

      This is a PoC implementation detail. The lazy mechanism can be
      extended to any other scheduling class w/o a big effort.

      I deliberately did not do that because:

        A) I'm lazy

        B) More importantly I wanted to demonstrate that as long as
           there are only SCHED_OTHER tasks involved there is no forced
           (via NEED_RESCHED) preemption unless the to be preempted task
           ignores the lazy resched request, which proves that
           cond_resched() can be avoided.

           At the same time such a kernel allows a RT task to preempt at
           any time.

   2) If the to be preempted task does not react within a certain time
      frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY
      request, which is the prerequisite to get rid of cond_resched()
      and related muck.

      That's obviously mandatory for getting rid of cond_resched() and
      related muck, no?

I concede that there are a lot of details to be discussed before we get
there, but I don't see a real show stopper yet.

The important point is that the details are basically boiling down to
policy decisions in the scheduler which are aided by hints from the
programmer.

As I said before we might end up with something like

   preempt_me_not_if_not_absolutely_required();
   ....
   preempt_me_I_dont_care();

(+/- name bike shedding) to give the scheduler a better understanding of
the context.

Something like that has distinct advantages over the current situation
with all the cond_resched() muck:

  1) It is clearly scope based

  2) It is properly nesting

  3) It can be easily made implicit for existing scope constructs like
     rcu_read_lock/unlock() or regular locking mechanisms.

The important point is that at the very end the scheduler has the
ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
random damage due to the fact that preemption count is functional, which
makes your life easier as well as you admitted already. But that does
not mean you can eat the cake and still have it. :)

That said, I completely understand your worries about the consequences,
but please take the step back and look at it from a conceptual point of
view.

The goal is to replace the hard coded (Kconfig or DYNAMIC) policy
mechanisms with a flexible scheduler controlled policy mechanism.

That allows you to focus on one consolidated model and optimize that
for particular policy scenarios instead of dealing with optimizing the
hell out of hardcoded policies which force you to come up with
horrible workaround for each of them.

Of course the policies have to be defined (scheduling classes affected
depending on model, hint/annotation meaning etc.), but that's way more
palatable than what we have now. Let me give you a simple example:

  Right now the only way out on preempt=none when a rogue code path
  which lacks a cond_resched() fails to release the CPU is a big fat
  stall splat and a hosed machine.

  I rather prefer to have the fully controlled hammer ready which keeps
  the machine usable and the situation debuggable.

  You still can yell in dmesg, but that again is a flexible policy
  decision and not hard coded by any means.

>> > 3.	For nohz_full CPUs that run for a long time in the kernel,
>> > 	there are no scheduling-clock interrupts.  RCU reaches for
>> > 	the resched_cpu() hammer a few jiffies into the grace period.
>> > 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
>> > 	interrupt-entry code will re-enable its scheduling-clock interrupt
>> > 	upon receiving the resched_cpu() IPI.
>> 
>> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
>> will cause it to preempt.
>
> That is not sufficient for nohz_full CPUs executing in userspace,

That's not what I was talking about. You said:

>> > 3.	For nohz_full CPUs that run for a long time in the kernel,
                                                           ^^^^^^
Duh! I did not realize that you meant user space. For user space there
is zero difference to the current situation. Once the task is out in
user space it's out of RCU side critical sections, so that's obiously
not a problem.

As I said: I might be confused. :)

>> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> CONFIG_RT or such as it does not really change the preemption
>> model itself. RT just reduces the preemption disabled sections with the
>> lock conversions, forced interrupt threading and some more.
>
> Again, please, no.
>
> There are situations where we still need rcu_read_lock() and
> rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> repectively.  Those can be cases selected only by Kconfig option, not
> available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.

Why are you so fixated on making everything hardcoded instead of making
it a proper policy decision problem. See above.

>> > 8.	As has been noted elsewhere, in this new limited-preemption
>> > 	mode of operation, rcu_read_lock() readers remain preemptible.
>> > 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
>> 
>> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
>
> That is in fact the problem.  Preemption can be good, but it is possible
> to have too much of a good thing, and preemptible RCU read-side critical
> sections definitely is in that category for some important workloads. ;-)

See above.

>> > 10.	The cond_resched_rcu() function must remain because we still
>> > 	have non-preemptible rcu_read_lock() readers.
>> 
>> Where?
>
> In datacenters.

See above.

>> > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
>> > 	might need to do something for non-preemptible RCU to make
>> > 	up for the lack of cond_resched() calls.  Maybe just drop the
>> > 	"IS_ENABLED()" and execute the body of the current "if" statement
>> > 	unconditionally.
>> 
>> Again. There is no non-preemtible RCU with this model, unless I'm
>> missing something important here.
>
> And again, there needs to be non-preemptible RCU with this model.

See above.

Thanks,

        tglx
Daniel Bristot de Oliveira Oct. 19, 2023, 12:37 p.m. UTC | #115
On 10/18/23 20:13, Paul E. McKenney wrote:
> On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
>> On Wed, 18 Oct 2023 10:55:02 -0700
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>>>> If everything becomes PREEMPT_RCU, then the above should be able to be
>>>> turned into just:
>>>>
>>>>                 if (!disable_irq)
>>>>                         local_irq_disable();
>>>>
>>>>                 rcu_momentary_dyntick_idle();
>>>>
>>>>                 if (!disable_irq)
>>>>                         local_irq_enable();
>>>>
>>>> And no cond_resched() is needed.  
>>>
>>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
>>> run_osnoise() is running in kthread context with preemption and everything
>>> else enabled (am I right?), then the change you suggest should work fine.
>>
>> There's a user space option that lets you run that loop with preemption and/or
>> interrupts disabled.
> 
> Ah, thank you.  Then as long as this function is not expecting an RCU
> reader to span that call to rcu_momentary_dyntick_idle(), all is well.
> This is a kthread, so there cannot be something else expecting an RCU
> reader to span that call.

Sorry for the delay, this thread is quite long (and I admit I should be paying
attention to it).

It seems that you both figure it out without me anyways. This piece of
code is preemptive unless a config is set to disable irq or preemption (as
steven mentioned). That call is just a ping to RCU to say that things
are fine.

So Steven's suggestion should work.

>>>>> Again. There is no non-preemtible RCU with this model, unless I'm
>>>>> missing something important here.  
>>>>
>>>> Daniel?  
>>>
>>> But very happy to defer to Daniel.  ;-)
>>
>> But Daniel could also correct me ;-)
> 
> If he figures out a way that it is broken, he gets to fix it.  ;-)

It works for me, keep in the loop for the patches and I can test and
adjust osnoise accordingly. osnoise should not be a reason to block more
important things like this patch set, and we can find a way out in
the osnoise tracer side. (I might need an assistance from rcu
people, but I know I can count on them :-).

Thanks!
-- Daniel
> 						Thanx, Paul
Paul E. McKenney Oct. 19, 2023, 5:08 p.m. UTC | #116
On Thu, Oct 19, 2023 at 02:37:23PM +0200, Daniel Bristot de Oliveira wrote:
> On 10/18/23 20:13, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote:
> >> On Wed, 18 Oct 2023 10:55:02 -0700
> >> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>
> >>>> If everything becomes PREEMPT_RCU, then the above should be able to be
> >>>> turned into just:
> >>>>
> >>>>                 if (!disable_irq)
> >>>>                         local_irq_disable();
> >>>>
> >>>>                 rcu_momentary_dyntick_idle();
> >>>>
> >>>>                 if (!disable_irq)
> >>>>                         local_irq_enable();
> >>>>
> >>>> And no cond_resched() is needed.  
> >>>
> >>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that
> >>> run_osnoise() is running in kthread context with preemption and everything
> >>> else enabled (am I right?), then the change you suggest should work fine.
> >>
> >> There's a user space option that lets you run that loop with preemption and/or
> >> interrupts disabled.
> > 
> > Ah, thank you.  Then as long as this function is not expecting an RCU
> > reader to span that call to rcu_momentary_dyntick_idle(), all is well.
> > This is a kthread, so there cannot be something else expecting an RCU
> > reader to span that call.
> 
> Sorry for the delay, this thread is quite long (and I admit I should be paying
> attention to it).
> 
> It seems that you both figure it out without me anyways. This piece of
> code is preemptive unless a config is set to disable irq or preemption (as
> steven mentioned). That call is just a ping to RCU to say that things
> are fine.
> 
> So Steven's suggestion should work.

Very good!

> >>>>> Again. There is no non-preemtible RCU with this model, unless I'm
> >>>>> missing something important here.  
> >>>>
> >>>> Daniel?  
> >>>
> >>> But very happy to defer to Daniel.  ;-)
> >>
> >> But Daniel could also correct me ;-)
> > 
> > If he figures out a way that it is broken, he gets to fix it.  ;-)
> 
> It works for me, keep in the loop for the patches and I can test and
> adjust osnoise accordingly. osnoise should not be a reason to block more
> important things like this patch set, and we can find a way out in
> the osnoise tracer side. (I might need an assistance from rcu
> people, but I know I can count on them :-).

For good or for bad, we will be here.  ;-)

							Thanx, Paul
Paul E. McKenney Oct. 19, 2023, 7:13 p.m. UTC | #117
Thomas!

On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> Paul!
> 
> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> > Belatedly calling out some RCU issues.  Nothing fatal, just a
> >> > (surprisingly) few adjustments that will need to be made.  The key thing
> >> > to note is that from RCU's viewpoint, with this change, all kernels
> >> > are preemptible, though rcu_read_lock() readers remain
> >> > non-preemptible.
> >> 
> >> Why? Either I'm confused or you or both of us :)
> >
> > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock()
> > as preempt_enable() in this approach?  I certainly hope so, as RCU
> > priority boosting would be a most unwelcome addition to many datacenter
> > workloads.
> 
> Sure, but that's an orthogonal problem, really.

Orthogonal, parallel, skew, whatever, it and its friends still need to
be addressed.

> >> With this approach the kernel is by definition fully preemptible, which
> >> means means rcu_read_lock() is preemptible too. That's pretty much the
> >> same situation as with PREEMPT_DYNAMIC.
> >
> > Please, just no!!!
> >
> > Please note that the current use of PREEMPT_DYNAMIC with preempt=none
> > avoids preempting RCU read-side critical sections.  This means that the
> > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption
> > of RCU readers in environments expecting no preemption.
> 
> It does not _avoid_ it, it simply _prevents_ it by not preempting in
> preempt_enable() and on return from interrupt so whatever sets
> NEED_RESCHED has to wait for a voluntary invocation of schedule(),
> cond_resched() or return to user space.

A distinction without a difference.  ;-)

> But under the hood RCU is fully preemptible and the boosting logic is
> active, but it does not have an effect until one of those preemption
> points is reached, which makes the boosting moot.

And for many distros, this appears to be just fine, not that I personally
know of anyone running large numbers of systems in production with
kernels built with CONFIG_PREEMPT_DYNAMIC=y and booted with preempt=none.
And let's face it, if you want exactly the same binary to support both
modes, you are stuck with the fully-preemptible implementation of RCU.
But we should not make a virtue of such a distro's necessity.

And some of us are not afraid to build our own kernels, which allows
us to completely avoid the added code required to make RCU read-side
critical sections be preemptible.

> >> For throughput sake this fully preemptible kernel provides a mechanism
> >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
> >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
> >> 
> >> That means the preemption points in preempt_enable() and return from
> >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to
> >> completion either to the point where they call schedule() or when they
> >> return to user space. That's pretty much what PREEMPT_NONE does today.
> >> 
> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched()
> >> points are not longer required because the scheduler can preempt the
> >> long running task by setting NEED_RESCHED instead.
> >> 
> >> That preemption might be suboptimal in some cases compared to
> >> cond_resched(), but from my initial experimentation that's not really an
> >> issue.
> >
> > I am not (repeat NOT) arguing for keeping cond_resched().  I am instead
> > arguing that the less-preemptible variants of the kernel should continue
> > to avoid preempting RCU read-side critical sections.
> 
> That's the whole point of the lazy mechanism:
> 
>    It avoids (repeat AVOIDS) preemption of any kernel code as much as it
>    can by _not_ setting NEED_RESCHED.
> 
>    The only difference is that it does not _prevent_ it like
>    preempt=none does. It will preempt when NEED_RESCHED is set.
> 
> Now the question is when will NEED_RESCHED be set?
> 
>    1) If the preempting task belongs to a scheduling class above
>       SCHED_OTHER
> 
>       This is a PoC implementation detail. The lazy mechanism can be
>       extended to any other scheduling class w/o a big effort.
> 
>       I deliberately did not do that because:
> 
>         A) I'm lazy
> 
>         B) More importantly I wanted to demonstrate that as long as
>            there are only SCHED_OTHER tasks involved there is no forced
>            (via NEED_RESCHED) preemption unless the to be preempted task
>            ignores the lazy resched request, which proves that
>            cond_resched() can be avoided.
> 
>            At the same time such a kernel allows a RT task to preempt at
>            any time.
> 
>    2) If the to be preempted task does not react within a certain time
>       frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY
>       request, which is the prerequisite to get rid of cond_resched()
>       and related muck.
> 
>       That's obviously mandatory for getting rid of cond_resched() and
>       related muck, no?

Keeping firmly in mind that there are no cond_resched() calls within RCU
read-side critical sections, sure.  Or, if you prefer, any such calls
are bugs.  And agreed, outside of atomic contexts (in my specific case,
including RCU readers), there does eventually need to be a preemption.

> I concede that there are a lot of details to be discussed before we get
> there, but I don't see a real show stopper yet.

Which is what I have been saying as well, at least as long as we can
have a way of building a kernel with a non-preemptible build of RCU.
And not just a preemptible RCU in which the scheduler (sometimes?)
refrains from preempting the RCU read-side critical sections, but
really only having the CONFIG_PREEMPT_RCU=n code built.

Give or take the needs of the KLP guys, but again, I must defer to
them.

> The important point is that the details are basically boiling down to
> policy decisions in the scheduler which are aided by hints from the
> programmer.
> 
> As I said before we might end up with something like
> 
>    preempt_me_not_if_not_absolutely_required();
>    ....
>    preempt_me_I_dont_care();
> 
> (+/- name bike shedding) to give the scheduler a better understanding of
> the context.
> 
> Something like that has distinct advantages over the current situation
> with all the cond_resched() muck:
> 
>   1) It is clearly scope based
> 
>   2) It is properly nesting
> 
>   3) It can be easily made implicit for existing scope constructs like
>      rcu_read_lock/unlock() or regular locking mechanisms.

You know, I was on board with throwing cond_resched() overboard (again,
give or take whatever KLP might need) when I first read of this in that
LWN article.  You therefore cannot possibly gain anything by continuing
to sell it to me, and, worse yet, you might provoke an heretofore-innocent
bystander into pushing some bogus but convincing argument against.  ;-)

Yes, there are risks due to additional state space exposed by the
additional preemption.  However, at least some of this is already covered
by quite a few people running preemptible kernels.  There will be some
not covered, given our sensitivity to low-probability bugs, but there
should also be some improvements in tail latency.  The process of getting
the first cond_resched()-free kernel deployed will therefore likely be
a bit painful, but overall the gains should be worth the pain.

> The important point is that at the very end the scheduler has the
> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
> random damage due to the fact that preemption count is functional, which
> makes your life easier as well as you admitted already. But that does
> not mean you can eat the cake and still have it. :)

Which is exactly why I need rcu_read_lock() to map to preempt_disable()
and rcu_read_unlock() to preempt_enable().  ;-)

> That said, I completely understand your worries about the consequences,
> but please take the step back and look at it from a conceptual point of
> view.

Conceptual point of view?  That sounds suspiciously academic.  Who are
you and what did you do with the real Thomas Gleixner?  ;-)

But yes, consequences are extremely important, as always.

> The goal is to replace the hard coded (Kconfig or DYNAMIC) policy
> mechanisms with a flexible scheduler controlled policy mechanism.

Are you saying that CONFIG_PREEMPT_RT will also be selected at boot time
and/or via debugfs?

> That allows you to focus on one consolidated model and optimize that
> for particular policy scenarios instead of dealing with optimizing the
> hell out of hardcoded policies which force you to come up with
> horrible workaround for each of them.
> 
> Of course the policies have to be defined (scheduling classes affected
> depending on model, hint/annotation meaning etc.), but that's way more
> palatable than what we have now. Let me give you a simple example:
> 
>   Right now the only way out on preempt=none when a rogue code path
>   which lacks a cond_resched() fails to release the CPU is a big fat
>   stall splat and a hosed machine.
> 
>   I rather prefer to have the fully controlled hammer ready which keeps
>   the machine usable and the situation debuggable.
> 
>   You still can yell in dmesg, but that again is a flexible policy
>   decision and not hard coded by any means.

And I have agreed from my first read of that LWN article that allowing
preemption of code where preempt_count()=0 is a good thing.

The only thing that I am pushing back on is specifially your wish to
always be running the CONFIG_PREEMPT_RCU=y RCU code.  Yes, that is what
single-binary distros will do, just as they do now.  But again, some of
us are happy to build our own kernels.

There might be other things that I should be pushing back on, but that
is all that I am aware of right now.  ;-)

> >> > 3.	For nohz_full CPUs that run for a long time in the kernel,
> >> > 	there are no scheduling-clock interrupts.  RCU reaches for
> >> > 	the resched_cpu() hammer a few jiffies into the grace period.
> >> > 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> >> > 	interrupt-entry code will re-enable its scheduling-clock interrupt
> >> > 	upon receiving the resched_cpu() IPI.
> >> 
> >> You can spare the IPI by setting NEED_RESCHED on the remote CPU which
> >> will cause it to preempt.
> >
> > That is not sufficient for nohz_full CPUs executing in userspace,
> 
> That's not what I was talking about. You said:
> 
> >> > 3.	For nohz_full CPUs that run for a long time in the kernel,
>                                                            ^^^^^^
> Duh! I did not realize that you meant user space. For user space there
> is zero difference to the current situation. Once the task is out in
> user space it's out of RCU side critical sections, so that's obiously
> not a problem.
> 
> As I said: I might be confused. :)

And I might well also be confused.  Here is my view for nohz_full CPUs:

o	Running in userspace.  RCU will ignore them without disturbing
	the CPU, courtesy of context tracking.  As you say, there is
	no way (absent extremely strange sidechannel attacks) to
	have a kernel RCU read-side critical section here.

	These CPUs will ignore NEED_RESCHED until they exit usermode
	one way or another.  This exit will usually be supplied by
	the scheduler's wakeup IPI for the newly awakened task.

	But just setting NEED_RESCHED without otherwise getting the
	CPU's full attention won't have any effect.

o	Running in the kernel entry/exit code.	RCU will ignore them
	without disturbing the CPU, courtesy of context tracking.
	Unlike usermode, you can type rcu_read_lock(), but if you do,
	lockdep will complain bitterly.

	Assuming the time in the kernel is sharply bounded, as it
	usually will be, these CPUs will respond to NEED_RESCHED in a
	timely manner.	For longer times in the kernel, please see below.

o	Running in the kernel in deep idle, that is, where RCU is not
	watching.  RCU will ignore them without disturbing the CPU,
	courtesy of context tracking.  As with the entry/exit code,
	you can type rcu_read_lock(), but if you do, lockdep will
	complain bitterly.

	The exact response to NEED_RESCHED depends on the type of idle
	loop, with (as I understand it) polling idle loops responding
	quickly and other idle loops needing some event to wake up
	the CPU.  This event is typically an IPI, as is the case when
	the scheduler wakes up a task on the CPU in question.

o	Running in other parts of the kernel, but with scheduling
	clock interrupt enabled.  The next scheduling clock interrupt
	will take care of both RCU and NEED_RESCHED.  Give or take
	policy decisions, as you say above.

o	Running in other parts of the kernel, but with scheduling clock
	interrupt disabled.  If there is a grace period waiting on this
	CPU, RCU will eventually set a flag and invoke resched_cpu(),
	which will get the CPU's attention via an IPI and will also turn
	the scheduling clock interrupt back on.

	I believe that a wakeup from the scheduler has the same effect,
	and that it uses an IPI to get the CPU's attention when needed,
	but it has been one good long time since I traced out all the
	details.

	However, given that there is to be no cond_resched(), setting
	NEED_RESCHED without doing something like an IPI to get that
	CPU's attention will still not be guarantee to have any effect,
	just as with the nohz_full CPU executing in userspace, correct?

Did I miss anything?

> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> CONFIG_RT or such as it does not really change the preemption
> >> model itself. RT just reduces the preemption disabled sections with the
> >> lock conversions, forced interrupt threading and some more.
> >
> > Again, please, no.
> >
> > There are situations where we still need rcu_read_lock() and
> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > repectively.  Those can be cases selected only by Kconfig option, not
> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> 
> Why are you so fixated on making everything hardcoded instead of making
> it a proper policy decision problem. See above.

Because I am one of the people who will bear the consequences.

In that same vein, why are you so opposed to continuing to provide
the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
is already in place, is extremely well tested, and you need to handle
preempt_disable()/preeempt_enable() regions of code in any case.  What is
the real problem here?

> >> > 8.	As has been noted elsewhere, in this new limited-preemption
> >> > 	mode of operation, rcu_read_lock() readers remain preemptible.
> >> > 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
> >> 
> >> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
> >
> > That is in fact the problem.  Preemption can be good, but it is possible
> > to have too much of a good thing, and preemptible RCU read-side critical
> > sections definitely is in that category for some important workloads. ;-)
> 
> See above.
> 
> >> > 10.	The cond_resched_rcu() function must remain because we still
> >> > 	have non-preemptible rcu_read_lock() readers.
> >> 
> >> Where?
> >
> > In datacenters.
> 
> See above.
> 
> >> > 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> >> > 	might need to do something for non-preemptible RCU to make
> >> > 	up for the lack of cond_resched() calls.  Maybe just drop the
> >> > 	"IS_ENABLED()" and execute the body of the current "if" statement
> >> > 	unconditionally.
> >> 
> >> Again. There is no non-preemtible RCU with this model, unless I'm
> >> missing something important here.
> >
> > And again, there needs to be non-preemptible RCU with this model.
> 
> See above.

And back at you with all three instances of "See above".  ;-)

							Thanx, Paul
Paul E. McKenney Oct. 20, 2023, 9:59 p.m. UTC | #118
On Thu, Oct 19, 2023 at 12:13:31PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> > On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> > > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> > >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:

[ . . . ]

> > >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> > >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> > >> CONFIG_RT or such as it does not really change the preemption
> > >> model itself. RT just reduces the preemption disabled sections with the
> > >> lock conversions, forced interrupt threading and some more.
> > >
> > > Again, please, no.
> > >
> > > There are situations where we still need rcu_read_lock() and
> > > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> > > repectively.  Those can be cases selected only by Kconfig option, not
> > > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> > 
> > Why are you so fixated on making everything hardcoded instead of making
> > it a proper policy decision problem. See above.
> 
> Because I am one of the people who will bear the consequences.
> 
> In that same vein, why are you so opposed to continuing to provide
> the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
> is already in place, is extremely well tested, and you need to handle
> preempt_disable()/preeempt_enable() regions of code in any case.  What is
> the real problem here?

I should hasten to add that from a conceptual viewpoint, I do support
the eventual elimination of CONFIG_PREEMPT_RCU=n code, but with emphasis
on the word "eventual".  Although preemptible RCU is plenty reliable if
you are running only a few thousand servers (and maybe even a few tens
of thousands), it has some improving to do before I will be comfortable
recommending its use in a large-scale datacenters.

And yes, I know about Android deployments.  But those devices tend
to spend very little time in the kernel, in fact, many of them tend to
spend very little time powered up.  Plus they tend to have relatively few
CPUs, at least by 2020s standards.  So it takes a rather large number of
Android devices to impose the same stress on the kernel that is imposed
by a single mid-sized server.

And we are working on making preemptible RCU more reliable.  One nice
change over the past 5-10 years is that more people are getting serious
about digging into the RCU code, testing it, and reporting and fixing the
resulting bugs.  I am also continuing to make rcutorture more vicious,
and of course I am greatly helped by the easier availability of hardware
with which to test RCU.

If this level of activity continues for another five years, then maybe
preemptible RCU will be ready for large datacenter deployments.

But I am guessing that you had something in mind in addition to code
consolidation.

							Thanx, Paul
Ankur Arora Oct. 20, 2023, 10:56 p.m. UTC | #119
Paul E. McKenney <paulmck@kernel.org> writes:

> Thomas!
>
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> Paul!
>>
>> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
>> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> >> CONFIG_RT or such as it does not really change the preemption
>> >> model itself. RT just reduces the preemption disabled sections with the
>> >> lock conversions, forced interrupt threading and some more.
>> >
>> > Again, please, no.
>> >
>> > There are situations where we still need rcu_read_lock() and
>> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
>> > repectively.  Those can be cases selected only by Kconfig option, not
>> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>>
>> Why are you so fixated on making everything hardcoded instead of making
>> it a proper policy decision problem. See above.
>
> Because I am one of the people who will bear the consequences.
>
> In that same vein, why are you so opposed to continuing to provide
> the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
> is already in place, is extremely well tested, and you need to handle
> preempt_disable()/preeempt_enable() regions of code in any case.  What is
> the real problem here?

I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y?

I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from
2015, stating that the only possible choice for PREEMPTION=y kernels
is PREEMPT_RCU=y:

    The RCU implementation is chosen based on PREEMPT and SMP config options
    and is not really a user-selectable choice.  This commit removes the
    menu entry, given that there is not much point in calling something a
    choice when there is in fact no choice..  The TINY_RCU, TREE_RCU, and
    PREEMPT_RCU Kconfig options continue to be selected based solely on the
    values of the PREEMPT and SMP options.

As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
stronger forward progress guarantees with respect to rcu readers (in
that they can't be preempted.)

So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
obvious there.


Thanks

--
ankur
Paul E. McKenney Oct. 20, 2023, 11:36 p.m. UTC | #120
On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > Thomas!
> >
> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> Paul!
> >>
> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> >> CONFIG_RT or such as it does not really change the preemption
> >> >> model itself. RT just reduces the preemption disabled sections with the
> >> >> lock conversions, forced interrupt threading and some more.
> >> >
> >> > Again, please, no.
> >> >
> >> > There are situations where we still need rcu_read_lock() and
> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> >> > repectively.  Those can be cases selected only by Kconfig option, not
> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> >>
> >> Why are you so fixated on making everything hardcoded instead of making
> >> it a proper policy decision problem. See above.
> >
> > Because I am one of the people who will bear the consequences.
> >
> > In that same vein, why are you so opposed to continuing to provide
> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
> > is already in place, is extremely well tested, and you need to handle
> > preempt_disable()/preeempt_enable() regions of code in any case.  What is
> > the real problem here?
> 
> I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y?

This Kconfig block in kernel/rcu/Kconfig:

------------------------------------------------------------------------

config PREEMPT_RCU
	bool
	default y if PREEMPTION
	select TREE_RCU
	help
	  This option selects the RCU implementation that is
	  designed for very large SMP systems with hundreds or
	  thousands of CPUs, but for which real-time response
	  is also required.  It also scales down nicely to
	  smaller systems.

	  Select this option if you are unsure.

------------------------------------------------------------------------

There is no prompt string after the "bool", so it is not user-settable.
Therefore, it is driven directly off of the value of PREEMPTION, taking
the global default of "n" if PREEMPTION is not set and "y" otherwise.

You could change the second line to read:

	bool "Go ahead!  Make my day!"

or preferably something more helpful.  This change would allow a
preemptible kernel to be built with non-preemptible RCU and vice versa,
as used to be the case long ago.  However, it might be way better to drive
the choice from some other Kconfig option and leave out the prompt string.

> I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from
> 2015, stating that the only possible choice for PREEMPTION=y kernels
> is PREEMPT_RCU=y:
> 
>     The RCU implementation is chosen based on PREEMPT and SMP config options
>     and is not really a user-selectable choice.  This commit removes the
>     menu entry, given that there is not much point in calling something a
>     choice when there is in fact no choice..  The TINY_RCU, TREE_RCU, and
>     PREEMPT_RCU Kconfig options continue to be selected based solely on the
>     values of the PREEMPT and SMP options.

The main point of this commit was to reduce testing effort and sysadm
confusion by removing choices that were not necessary back then.

> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
> stronger forward progress guarantees with respect to rcu readers (in
> that they can't be preempted.)

TREE_RCU=y is absolutely required if you want a kernel to run on a system
with more than one CPU, and for that matter, if you want preemptible RCU,
even on a single-CPU system.

> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
> obvious there.

If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
can run any combination:

PREEMPTION && PREEMPT_RCU:  This is what we use today for preemptible
	kernels, so this works just fine (famous last words).

PREEMPTION && !PREEMPT_RCU:  A preemptible kernel with non-preemptible
	RCU, so that rcu_read_lock() is preempt_disable() and
	rcu_read_unlock() is preempt_enable().	This should just work,
	except for the fact that cond_resched() disappears, which
	stymies some of RCU's forward-progress mechanisms.  And this
	was the topic of our earlier discussion on this thread.  The
	fixes should not be too hard.

	Of course, this has not been either tested or used for at least
	eight years, so there might be some bitrot.  If so, I will of
	course be happy to help fix it.

!PREEMPTION && PREEMPT_RCU:  A non-preemptible kernel with preemptible
	RCU.  Although this particular combination of Kconfig
	options has not been tested for at least eight years, giving
	a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
	kernel boot parameter gets you pretty close.  Again, there is
	likely to be some bitrot somewhere, but way fewer bits to rot
	than for PREEMPTION && !PREEMPT_RCU.  Outside of the current
	CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
	combination, but if there is a need and if it is broken, I will
	be happy to help fix it.

!PREEMPTION && !PREEMPT_RCU:  A non-preemptible kernel with non-preemptible
	RCU, which is what we use today for non-preemptible kernels built
	with CONFIG_PREEMPT_DYNAMIC=n.	So to repeat those famous last
	works, this works just fine.

Does that help, or am I missing the point of your question?

							Thanx, Paul
Ankur Arora Oct. 21, 2023, 1:05 a.m. UTC | #121
Paul E. McKenney <paulmck@kernel.org> writes:

> On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@kernel.org> writes:
>>
>> > Thomas!
>> >
>> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> >> Paul!
>> >>
>> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
>> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
>> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
>> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
>> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
>> >> >> CONFIG_RT or such as it does not really change the preemption
>> >> >> model itself. RT just reduces the preemption disabled sections with the
>> >> >> lock conversions, forced interrupt threading and some more.
>> >> >
>> >> > Again, please, no.
>> >> >
>> >> > There are situations where we still need rcu_read_lock() and
>> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
>> >> > repectively.  Those can be cases selected only by Kconfig option, not
>> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
>> >>
>> >> Why are you so fixated on making everything hardcoded instead of making
>> >> it a proper policy decision problem. See above.
>> >
>> > Because I am one of the people who will bear the consequences.
>> >
>> > In that same vein, why are you so opposed to continuing to provide
>> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
>> > is already in place, is extremely well tested, and you need to handle
>> > preempt_disable()/preeempt_enable() regions of code in any case.  What is
>> > the real problem here?
>>

[ snip ]

>> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
>> stronger forward progress guarantees with respect to rcu readers (in
>> that they can't be preempted.)
>
> TREE_RCU=y is absolutely required if you want a kernel to run on a system
> with more than one CPU, and for that matter, if you want preemptible RCU,
> even on a single-CPU system.
>
>> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
>> obvious there.
>
> If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
> can run any combination:

Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y,
PREEMPT_RCU=n).

> PREEMPTION && PREEMPT_RCU:  This is what we use today for preemptible
> 	kernels, so this works just fine (famous last words).
>
> PREEMPTION && !PREEMPT_RCU:  A preemptible kernel with non-preemptible
> 	RCU, so that rcu_read_lock() is preempt_disable() and
> 	rcu_read_unlock() is preempt_enable().	This should just work,
> 	except for the fact that cond_resched() disappears, which
> 	stymies some of RCU's forward-progress mechanisms.  And this
> 	was the topic of our earlier discussion on this thread.  The
> 	fixes should not be too hard.
>
> 	Of course, this has not been either tested or used for at least
> 	eight years, so there might be some bitrot.  If so, I will of
> 	course be happy to help fix it.
>
>
> !PREEMPTION && PREEMPT_RCU:  A non-preemptible kernel with preemptible
> 	RCU.  Although this particular combination of Kconfig
> 	options has not been tested for at least eight years, giving
> 	a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
> 	kernel boot parameter gets you pretty close.  Again, there is
> 	likely to be some bitrot somewhere, but way fewer bits to rot
> 	than for PREEMPTION && !PREEMPT_RCU.  Outside of the current
> 	CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
> 	combination, but if there is a need and if it is broken, I will
> 	be happy to help fix it.
>
> !PREEMPTION && !PREEMPT_RCU:  A non-preemptible kernel with non-preemptible
> 	RCU, which is what we use today for non-preemptible kernels built
> 	with CONFIG_PREEMPT_DYNAMIC=n.	So to repeat those famous last
> 	works, this works just fine.
>
> Does that help, or am I missing the point of your question?

It does indeed. What I was going for, is that this series (or, at
least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION
in spirit, while doing away with it as a compile-time config option.

That it does, as TGLX mentioned upthread, by moving all of the policy
to the scheduler, which can be tuned by user-space (via sched-features.)

So, my question was in response to this:

>> > In that same vein, why are you so opposed to continuing to provide
>> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
>> > is already in place, is extremely well tested, and you need to handle
>> > preempt_disable()/preeempt_enable() regions of code in any case.  What is
>> > the real problem here?

Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration
seems to be eminently usable with this configuration.

(Or maybe I'm missed the point of that discussion.)

On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n,
TREE_RCU=y) kernel some hours ago. Nothing broken (yet!).

--
ankur
Paul E. McKenney Oct. 21, 2023, 2:08 a.m. UTC | #122
On Fri, Oct 20, 2023 at 06:05:21PM -0700, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote:
> >>
> >> Paul E. McKenney <paulmck@kernel.org> writes:
> >>
> >> > Thomas!
> >> >
> >> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> >> Paul!
> >> >>
> >> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote:
> >> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote:
> >> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> >> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
> >> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to
> >> >> >> CONFIG_RT or such as it does not really change the preemption
> >> >> >> model itself. RT just reduces the preemption disabled sections with the
> >> >> >> lock conversions, forced interrupt threading and some more.
> >> >> >
> >> >> > Again, please, no.
> >> >> >
> >> >> > There are situations where we still need rcu_read_lock() and
> >> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(),
> >> >> > repectively.  Those can be cases selected only by Kconfig option, not
> >> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y.
> >> >>
> >> >> Why are you so fixated on making everything hardcoded instead of making
> >> >> it a proper policy decision problem. See above.
> >> >
> >> > Because I am one of the people who will bear the consequences.
> >> >
> >> > In that same vein, why are you so opposed to continuing to provide
> >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
> >> > is already in place, is extremely well tested, and you need to handle
> >> > preempt_disable()/preeempt_enable() regions of code in any case.  What is
> >> > the real problem here?
> >>
> 
> [ snip ]
> 
> >> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly
> >> stronger forward progress guarantees with respect to rcu readers (in
> >> that they can't be preempted.)
> >
> > TREE_RCU=y is absolutely required if you want a kernel to run on a system
> > with more than one CPU, and for that matter, if you want preemptible RCU,
> > even on a single-CPU system.
> >
> >> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something
> >> obvious there.
> >
> > If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you
> > can run any combination:
> 
> Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y,
> PREEMPT_RCU=n).
> 
> > PREEMPTION && PREEMPT_RCU:  This is what we use today for preemptible
> > 	kernels, so this works just fine (famous last words).
> >
> > PREEMPTION && !PREEMPT_RCU:  A preemptible kernel with non-preemptible
> > 	RCU, so that rcu_read_lock() is preempt_disable() and
> > 	rcu_read_unlock() is preempt_enable().	This should just work,
> > 	except for the fact that cond_resched() disappears, which
> > 	stymies some of RCU's forward-progress mechanisms.  And this
> > 	was the topic of our earlier discussion on this thread.  The
> > 	fixes should not be too hard.
> >
> > 	Of course, this has not been either tested or used for at least
> > 	eight years, so there might be some bitrot.  If so, I will of
> > 	course be happy to help fix it.
> >
> >
> > !PREEMPTION && PREEMPT_RCU:  A non-preemptible kernel with preemptible
> > 	RCU.  Although this particular combination of Kconfig
> > 	options has not been tested for at least eight years, giving
> > 	a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none
> > 	kernel boot parameter gets you pretty close.  Again, there is
> > 	likely to be some bitrot somewhere, but way fewer bits to rot
> > 	than for PREEMPTION && !PREEMPT_RCU.  Outside of the current
> > 	CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this
> > 	combination, but if there is a need and if it is broken, I will
> > 	be happy to help fix it.
> >
> > !PREEMPTION && !PREEMPT_RCU:  A non-preemptible kernel with non-preemptible
> > 	RCU, which is what we use today for non-preemptible kernels built
> > 	with CONFIG_PREEMPT_DYNAMIC=n.	So to repeat those famous last
> > 	works, this works just fine.
> >
> > Does that help, or am I missing the point of your question?
> 
> It does indeed. What I was going for, is that this series (or, at
> least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION
> in spirit, while doing away with it as a compile-time config option.
> 
> That it does, as TGLX mentioned upthread, by moving all of the policy
> to the scheduler, which can be tuned by user-space (via sched-features.)
> 
> So, my question was in response to this:
> 
> >> > In that same vein, why are you so opposed to continuing to provide
> >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n?  This code
> >> > is already in place, is extremely well tested, and you need to handle
> >> > preempt_disable()/preeempt_enable() regions of code in any case.  What is
> >> > the real problem here?
> 
> Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration
> seems to be eminently usable with this configuration.
> 
> (Or maybe I'm missed the point of that discussion.)
> 
> On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n,
> TREE_RCU=y) kernel some hours ago. Nothing broken (yet!).

Thank you, and here is hoping!  ;-)

							Thanx, Paul
Thomas Gleixner Oct. 24, 2023, 12:15 p.m. UTC | #123
Paul!

On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote:
> On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
>> The important point is that at the very end the scheduler has the
>> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
>> random damage due to the fact that preemption count is functional, which
>> makes your life easier as well as you admitted already. But that does
>> not mean you can eat the cake and still have it. :)
>
> Which is exactly why I need rcu_read_lock() to map to preempt_disable()
> and rcu_read_unlock() to preempt_enable().  ;-)

After reading back in the thread, I think we greatly talked past each
other mostly due to the different expectations and the resulting
dependencies which seem to be hardwired into our brains.

I'm pleading guilty as charged as I failed completely to read your
initial statement

 "The key thing to note is that from RCU's viewpoint, with this change,
  all kernels are preemptible, though rcu_read_lock() readers remain
  non-preemptible."

with that in mind and instead of dissecting it properly I committed the
fallacy of stating exactly the opposite, which obviously reflects only
the point of view I'm coming from.

With a fresh view, this turns out to be a complete non-problem because
there is no semantical dependency between the preemption model and the
RCU flavour.

The unified kernel preemption model has the following properties:

  1) It provides full preemptive multitasking.

  2) Preemptability is limited by implicit and explicit mechanisms.

  3) The ability to avoid overeager preemption for SCHED_OTHER tasks via
     the PREEMPT_LAZY mechanism.

     This emulates the NONE/VOLUNTARY preemption models which
     semantically provide collaborative multitasking.

     This emulation is not breaking the semantical properties of full
     preemptive multitasking because the scheduler still has the ability
     to enforce immediate preemption under consideration of #2.

     Which in turn is a prerequiste for removing the semantically
     ill-defined cond/might_resched() constructs.

The compile time selectable RCU flavour (preemptible/non-preemptible) is
not imposing a semantical change on this unified preemption model.

The selection of the RCU flavour is solely affecting the preemptability
(#2 above). Selecting non-preemptible RCU reduces preemptability by
adding an implicit restriction via mapping rcu_read_lock()
to preempt_disable().

IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n
is only enforced by the the lack of the full preempt counter in
PREEMPTION=n configs. Once the preemption counter is always enabled this
hardwired dependency goes away.

Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because
with PREEMPT_DYNAMIC the preemption counter is unconditionally
available.

So that makes these hardwired dependencies go away in practice and
hopefully soon from our mental models too :)

RT will keep its hard dependency on RCU_PREEMPT in the same way it
depends hard on forced interrupt threading and other minor details to
enable the spinlock substitution.

>> That said, I completely understand your worries about the consequences,
>> but please take the step back and look at it from a conceptual point of
>> view.
>
> Conceptual point of view?  That sounds suspiciously academic.

Hehehe.

> Who are you and what did you do with the real Thomas Gleixner?  ;-)

The point I'm trying to make is not really academic, it comes from a
very practical point of view. As you know for almost two decades I'm
mostly busy with janitoring and mopping up the kernel.

A major takeaway from this eclectic experience is that there is a
tendency to implement very specialized solutions for different classes
of use cases.

The reasons to do so in the first place:

 1) Avoid breaking the existing and established solutions:

    E.g. the initial separation of x8664 and i386

 2) Enforcement due to dependencies on mechanisms, which are
    considered "harmful" for particular use cases

    E.g. Preemptible RCU, which is separate also due to #1

 3) Because we can and something is sooo special

    You probably remember the full day we both spent in a room with SoC
    people to make them understand that their SoCs are not so special at
    all. :)

So there are perfectly valid reasons (#1, #2) to separate things, but we
really need to go back from time to time and think hard about the
question whether a particular separation is still justified. This is
especially true when dependencies or prerequisites change.

But in many cases we just keep going, take the separation as set in
stone forever and add features and workarounds on all ends without
rethinking whether we could unify these things for the better. The real
bad thing about this is that the more we add to the separation the
harder consolidation or unification becomes.

Granted that my initial take of consolidating on preemptible RCU might
be too brisk or too naive, but I still think that with the prospect of
an unified preemption model it's at least worth to have a very close
look at this question.

Not asking such questions or dismissing them upfront is a real danger
for the long term sustainability and maintainability of the kernel in my
opinion. Especially when the few people who actively "janitor" these
things are massively outnumbered by people who indulge in
specialization. :)

That said, the real Thomas Gleixner and his grumpy self are still there,
just slightly tired of handling the slurry brush all day long :)

Thanks,

        tglx
Steven Rostedt Oct. 24, 2023, 2:34 p.m. UTC | #124
On Tue, 19 Sep 2023 01:42:03 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.
> 
>       As NEED_RESCHED_LAZY is not folded into the preemption count the
>       preemption count won't become zero, so the task can continue until
>       it hits return to user space.
> 
>       That preserves the existing behaviour.

I'm looking into extending this concept to user space and to VMs.

I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")

The ideas is this. Have VMs/user space share a memory region with the
kernel that is per thread/vCPU. This would be registered via a syscall or
ioctl on some defined file or whatever. Then, when entering user space /
VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
checks if the thread has this memory region and a special bit in it is
set, and if it does, it does not schedule. It will treat it like a long
kernel system call.

The kernel will then set another bit in the shared memory region that will
tell user space / VM that the kernel wanted to schedule, but is allowing it
to finish its critical section. When user space / VM is done with the
critical section, it will check the bit that may be set by the kernel and
if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
now schedule it.

What about DOS you say? It's no different than running a long system call.
No task can run forever. It's not a "preempt disable", it's just "give me
some more time". A "NEED_RESCHED" will always schedule, just like a kernel
system call that takes a long time. The goal is to allow user space to get
out of critical sections that we know can cause problems if they get
preempted. Usually it's a user space / VM lock is held or maybe a VM
interrupt handler that needs to wake up a task on another vCPU.

If we are worried about abuse, we could even punish tasks that don't call
sched_yield() by the time its extended time slice is taken. Even without
that punishment, if we have EEVDF, this extension will make it less
eligible the next time around.

The goal is to prevent a thread / vCPU being preempted while holding a lock
or resource that other threads / vCPUs will want. That is, prevent
contention, as that's usually the biggest issue with performance in user
space and VMs.

I'm going to work on a POC, and see if I can get some benchmarks on how
much this could help tasks like databases and VMs in general.

-- Steve
Paul E. McKenney Oct. 24, 2023, 6:59 p.m. UTC | #125
On Tue, Oct 24, 2023 at 02:15:25PM +0200, Thomas Gleixner wrote:
> Paul!
> 
> On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote:
> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote:
> >> The important point is that at the very end the scheduler has the
> >> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any
> >> random damage due to the fact that preemption count is functional, which
> >> makes your life easier as well as you admitted already. But that does
> >> not mean you can eat the cake and still have it. :)
> >
> > Which is exactly why I need rcu_read_lock() to map to preempt_disable()
> > and rcu_read_unlock() to preempt_enable().  ;-)
> 
> After reading back in the thread, I think we greatly talked past each
> other mostly due to the different expectations and the resulting
> dependencies which seem to be hardwired into our brains.
> 
> I'm pleading guilty as charged as I failed completely to read your
> initial statement
> 
>  "The key thing to note is that from RCU's viewpoint, with this change,
>   all kernels are preemptible, though rcu_read_lock() readers remain
>   non-preemptible."
> 
> with that in mind and instead of dissecting it properly I committed the
> fallacy of stating exactly the opposite, which obviously reflects only
> the point of view I'm coming from.
> 
> With a fresh view, this turns out to be a complete non-problem because
> there is no semantical dependency between the preemption model and the
> RCU flavour.

Agreed, and been there and done that myself, as you well know!  ;-)

> The unified kernel preemption model has the following properties:
> 
>   1) It provides full preemptive multitasking.
> 
>   2) Preemptability is limited by implicit and explicit mechanisms.
> 
>   3) The ability to avoid overeager preemption for SCHED_OTHER tasks via
>      the PREEMPT_LAZY mechanism.
> 
>      This emulates the NONE/VOLUNTARY preemption models which
>      semantically provide collaborative multitasking.
> 
>      This emulation is not breaking the semantical properties of full
>      preemptive multitasking because the scheduler still has the ability
>      to enforce immediate preemption under consideration of #2.
> 
>      Which in turn is a prerequiste for removing the semantically
>      ill-defined cond/might_resched() constructs.
> 
> The compile time selectable RCU flavour (preemptible/non-preemptible) is
> not imposing a semantical change on this unified preemption model.
> 
> The selection of the RCU flavour is solely affecting the preemptability
> (#2 above). Selecting non-preemptible RCU reduces preemptability by
> adding an implicit restriction via mapping rcu_read_lock()
> to preempt_disable().
> 
> IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n
> is only enforced by the the lack of the full preempt counter in
> PREEMPTION=n configs. Once the preemption counter is always enabled this
> hardwired dependency goes away.
> 
> Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because
> with PREEMPT_DYNAMIC the preemption counter is unconditionally
> available.
> 
> So that makes these hardwired dependencies go away in practice and
> hopefully soon from our mental models too :)

The real reason for tying RCU_PREEMPT to PREEMPTION back in the day was
that there were no real-world uses of RCU_PREEMPT not matching PREEMPTION,
so those combinations were ruled out in order to reduce the number of
rcutorture scenarios.

But now it appears that we do have a use case for PREEMPTION=y and
RCU_PREEMPT=n, plus I have access to way more test hardware, so that
the additional rcutorture scenarios are less of a testing burden.

> RT will keep its hard dependency on RCU_PREEMPT in the same way it
> depends hard on forced interrupt threading and other minor details to
> enable the spinlock substitution.

"other minor details".  ;-)

Making PREEMPT_RT select RCU_PREEMPT makes sense to me!

> >> That said, I completely understand your worries about the consequences,
> >> but please take the step back and look at it from a conceptual point of
> >> view.
> >
> > Conceptual point of view?  That sounds suspiciously academic.
> 
> Hehehe.
> 
> > Who are you and what did you do with the real Thomas Gleixner?  ;-)
> 
> The point I'm trying to make is not really academic, it comes from a
> very practical point of view. As you know for almost two decades I'm
> mostly busy with janitoring and mopping up the kernel.
> 
> A major takeaway from this eclectic experience is that there is a
> tendency to implement very specialized solutions for different classes
> of use cases.
> 
> The reasons to do so in the first place:
> 
>  1) Avoid breaking the existing and established solutions:
> 
>     E.g. the initial separation of x8664 and i386
> 
>  2) Enforcement due to dependencies on mechanisms, which are
>     considered "harmful" for particular use cases
> 
>     E.g. Preemptible RCU, which is separate also due to #1
> 
>  3) Because we can and something is sooo special
> 
>     You probably remember the full day we both spent in a room with SoC
>     people to make them understand that their SoCs are not so special at
>     all. :)

   4) Because we don't see a use for a given combination, and we
      want to keep test time down to a dull roar, as noted above.

> So there are perfectly valid reasons (#1, #2) to separate things, but we
> really need to go back from time to time and think hard about the
> question whether a particular separation is still justified. This is
> especially true when dependencies or prerequisites change.
> 
> But in many cases we just keep going, take the separation as set in
> stone forever and add features and workarounds on all ends without
> rethinking whether we could unify these things for the better. The real
> bad thing about this is that the more we add to the separation the
> harder consolidation or unification becomes.
> 
> Granted that my initial take of consolidating on preemptible RCU might
> be too brisk or too naive, but I still think that with the prospect of
> an unified preemption model it's at least worth to have a very close
> look at this question.
> 
> Not asking such questions or dismissing them upfront is a real danger
> for the long term sustainability and maintainability of the kernel in my
> opinion. Especially when the few people who actively "janitor" these
> things are massively outnumbered by people who indulge in
> specialization. :)

Longer term, I do agree in principle with the notion of simplifying the
Linux-kernel RCU implementation by eliminating the PREEMPT_RCU=n code.
In the near term practice, here are the reasons for holding off on
this consolidation:

1.	Preemptible RCU needs more work for datacenter deployments,
	as mentioned earlier.  I also reiterate that if you only have
	a few thousand (or maybe even a few tens of thousand) servers,
	preemptible RCU will be just fine for you.  Give or take the
	safety criticality of your application.

2.	RCU priority boosting has not yet been really tested and tuned
	for systems that are adequately but not generously endowed with
	memory.  Boost too soon and you needlessly burning cycles and
	preempt important tasks.  Boost too late and it is OOM for you!

3.	To the best of my knowledge, the scheduler doesn't take memory
	footprint into account.  In particular, if a long-running RCU
	reader is preempted in a memory-naive fashion, all we gain
	is turning a potentially unimportant latency outlier into a
	definitely important OOM.

4.	There are probably a few gotchas that I haven't thought of or
	that I am forgetting.  More likely, more than a few.  As always!

But to your point, yes, these are things that we should be able to do
something about, given appropriate time and effort.  My guess is five
years, with the long pole being the reliability.  Preemptible RCU has
been gone through line by line recently, which is an extremely good
thing and an extremely welcome change from past practice, but that is
just a start.  That effort was getting people familiar with the code,
and should not be mistaken for a find-lots-of-bugs review session,
let alone a find-all-bugs review session.

> That said, the real Thomas Gleixner and his grumpy self are still there,
> just slightly tired of handling the slurry brush all day long :)

Whew!!!  Good to hear that the real Thomas Gleixner is still with us!!! ;-)

							Thanx, Paul
Steven Rostedt Oct. 25, 2023, 1:49 a.m. UTC | #126
On Tue, 24 Oct 2023 10:34:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'm going to work on a POC, and see if I can get some benchmarks on how
> much this could help tasks like databases and VMs in general.

And that was much easier than I thought it would be. It also shows some
great results!

I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

So you need to select:

  CONFIG_PREEMPT_AUTO

The below is my proof of concept patch. It still has debugging in it, and
I'm sure the interface will need to be changed.

There's now a new file:  /sys/kernel/extend_sched

Attached is a program that tests it. It mmaps that file, with:

 struct extend_map {
	unsigned long		flags;
 };

 static __thread struct extend_map *extend_map;

That is, there's this structure for every thread. It's assigned with:

	fd = open("/sys/kernel/extend_sched", O_RDWR);
	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

I don't actually like this interface, as it wastes a full page for just two
bits :-p

Anyway, to tell the kernel to "extend" the time slice if possible because
it's in a critical section, we have:

 static void extend(void)
 {
	if (!extend_map)
		return;

	extend_map->flags = 1;
 }

And to say that's it's done:

 static void unextend(void)
 {
	unsigned long prev;

	if (!extend_map)
		return;

	prev = xchg(&extend_map->flags, 0);
	if (prev & 2)
		sched_yield();
 }

So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() when done".

This test program creates 1 + number of CPUs threads, that run in a loop
for 5 seconds. Each thread will grab a user space spin lock (not a futex,
but just shared memory). Before grabbing the lock it will call "extend()",
if it fails to grab the lock, it calls "unextend()" and spins on the lock
until its free, where it will try again. Then after it gets the lock, it
will update a counter, and release the lock, calling "unextend()" as well.
Then it will spin on the counter until it increments again to allow another
task to get into the critical section.

With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:

 Ran for 3908165 times
 Total wait time: 33.965654

I can give you stdev and all that too, but the above is pretty much the
same after several runs.

After enabling the extend code, it has:

 Ran for 4829340 times
 Total wait time: 32.635407

It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!

The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).

I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)

 trace-cmd record -e sched_switch ./extend-sched

It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!

Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):

Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:

  gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread

You can just do:

 grep -v tracefs extend-sched.c > extend-sched-notracefs.c

And build that.

-- Steve

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b13b7d4f1d3..fb540dd0dec0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,10 @@ struct kmap_ctrl {
 #endif
 };
 
+struct extend_map {
+	long				flags;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -802,6 +806,8 @@ struct task_struct {
 	unsigned int			core_occupation;
 #endif
 
+	struct extend_map		*extend_map;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..21d0e4d81d33 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -147,17 +147,32 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
+	unsigned long ignore_mask;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
 	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+		ignore_mask = 0;
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & _TIF_NEED_RESCHED) {
 			schedule();
 
+		} else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+			if (!current->extend_map ||
+			    !(current->extend_map->flags & 1)) {
+				schedule();
+			} else {
+				trace_printk("Extend!\n");
+				/* Allow to leave with NEED_RESCHED_LAZY still set */
+				ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+				current->extend_map->flags |= 2;
+			}
+		}
+
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
 
@@ -184,6 +199,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
+		ti_work &= ~ignore_mask;
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..ddf89ec9ab62 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,6 +906,13 @@ void __noreturn do_exit(long code)
 	if (tsk->io_context)
 		exit_io_context(tsk);
 
+	if (tsk->extend_map) {
+		unsigned long addr = (unsigned long)tsk->extend_map;
+
+		virt_to_page(addr)->mapping = NULL;
+		free_page(addr);
+	}
+
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..da2214082d25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	tsk->extend_map = NULL;
+
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..297061cfa08d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@ obj-y += core.o
 obj-y += fair.o
 obj-y += build_policy.o
 obj-y += build_utility.o
+obj-y += extend.o
diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c
new file mode 100644
index 000000000000..a632e1a8f57b
--- /dev/null
+++ b/kernel/sched/extend.c
@@ -0,0 +1,90 @@
+#include <linux/kobject.h>
+#include <linux/pagemap.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+#ifdef CONFIG_SYSFS
+static ssize_t extend_sched_read(struct file *file,  struct kobject *kobj,
+				 struct bin_attribute *bin_attr,
+				 char *buf, loff_t off, size_t len)
+{
+	static const char output[] = "Extend scheduling time slice\n";
+
+	printk("%s:%d\n", __func__, __LINE__);
+	if (off >= sizeof(output))
+		return 0;
+
+	strscpy(buf, output + off, len);
+	return min((ssize_t)len, sizeof(output) - off - 1);
+}
+
+static ssize_t extend_sched_write(struct file *file, struct kobject *kobj,
+				  struct bin_attribute *bin_attr,
+				  char *buf, loff_t off, size_t len)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	return -EINVAL;
+}
+
+static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	trace_printk("%s:%d\n", __func__, __LINE__);
+	/* Only has one page */
+	if (vmf->pgoff || !current->extend_map)
+		return ret;
+
+	vmf->page = virt_to_page(current->extend_map);
+
+	get_page(vmf->page);
+	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+	vmf->page->index   = vmf->pgoff;
+
+	return 0;
+}
+
+static void extend_sched_mmap_open(struct vm_area_struct *vma)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	WARN_ON(!current->extend_map);
+}
+
+static const struct vm_operations_struct extend_sched_vmops = {
+	.open		= extend_sched_mmap_open,
+	.fault		= extend_sched_mmap_fault,
+};
+
+static int extend_sched_mmap(struct file *file, struct kobject *kobj,
+			     struct bin_attribute *attr,
+			     struct vm_area_struct *vma)
+{
+	if (current->extend_map)
+		return -EBUSY;
+
+	current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO));
+	if (!current->extend_map)
+		return -ENOMEM;
+
+	vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0);
+	vma->vm_ops = &extend_sched_vmops;
+
+	return 0;
+}
+
+static struct bin_attribute extend_sched_attr = {
+	.attr = {
+		.name = "extend_sched",
+		.mode = 0777,
+	},
+	.read = &extend_sched_read,
+	.write = &extend_sched_write,
+	.mmap = &extend_sched_mmap,
+};
+
+static __init int extend_init(void)
+{
+	return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr);
+}
+late_initcall(extend_init);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
 		resched_curr(rq);
 	} else {
 		/* Did the task ignore the lazy reschedule request? */
-		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+			trace_printk("Force resched?\n");
 			resched_curr(rq);
-		else
+		} else
 			resched_curr_lazy(rq);
 	}
 	clear_buddies(cfs_rq, se);
Sergey Senozhatsky Oct. 26, 2023, 7:50 a.m. UTC | #127
On (23/10/24 10:34), Steven Rostedt wrote:
> On Tue, 19 Sep 2023 01:42:03 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >    2) When the scheduler wants to set NEED_RESCHED due it sets
> >       NEED_RESCHED_LAZY instead which is only evaluated in the return to
> >       user space preemption points.
> > 
> >       As NEED_RESCHED_LAZY is not folded into the preemption count the
> >       preemption count won't become zero, so the task can continue until
> >       it hits return to user space.
> > 
> >       That preserves the existing behaviour.
> 
> I'm looking into extending this concept to user space and to VMs.
> 
> I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")
> 
> The ideas is this. Have VMs/user space share a memory region with the
> kernel that is per thread/vCPU. This would be registered via a syscall or
> ioctl on some defined file or whatever. Then, when entering user space /
> VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
> checks if the thread has this memory region and a special bit in it is
> set, and if it does, it does not schedule. It will treat it like a long
> kernel system call.
> 
> The kernel will then set another bit in the shared memory region that will
> tell user space / VM that the kernel wanted to schedule, but is allowing it
> to finish its critical section. When user space / VM is done with the
> critical section, it will check the bit that may be set by the kernel and
> if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
> now schedule it.
> 
> What about DOS you say? It's no different than running a long system call.
> No task can run forever. It's not a "preempt disable", it's just "give me
> some more time". A "NEED_RESCHED" will always schedule, just like a kernel
> system call that takes a long time. The goal is to allow user space to get
> out of critical sections that we know can cause problems if they get
> preempted. Usually it's a user space / VM lock is held or maybe a VM
> interrupt handler that needs to wake up a task on another vCPU.
> 
> If we are worried about abuse, we could even punish tasks that don't call
> sched_yield() by the time its extended time slice is taken. Even without
> that punishment, if we have EEVDF, this extension will make it less
> eligible the next time around.
> 
> The goal is to prevent a thread / vCPU being preempted while holding a lock
> or resource that other threads / vCPUs will want. That is, prevent
> contention, as that's usually the biggest issue with performance in user
> space and VMs.

I think some time ago we tried to check guest's preempt count on each vm-exit
and we'd vm-enter if guest exited from a critical section (those that bump
preempt count) so that it can hopefully finish whatever is was going to
do and vmexit again. We didn't look into covering guest's RCU read-side
critical sections.

Can you educate me, is your PoC significantly different from guest preempt
count check?
Steven Rostedt Oct. 26, 2023, 12:48 p.m. UTC | #128
On Thu, 26 Oct 2023 16:50:16 +0900
Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> > The goal is to prevent a thread / vCPU being preempted while holding a lock
> > or resource that other threads / vCPUs will want. That is, prevent
> > contention, as that's usually the biggest issue with performance in user
> > space and VMs.  
> 
> I think some time ago we tried to check guest's preempt count on each vm-exit
> and we'd vm-enter if guest exited from a critical section (those that bump
> preempt count) so that it can hopefully finish whatever is was going to
> do and vmexit again. We didn't look into covering guest's RCU read-side
> critical sections.
> 
> Can you educate me, is your PoC significantly different from guest preempt
> count check?

No, it's probably very similar. Just the mechanism to allow it to run
longer may be different.

-- Steve
diff mbox series

Patch

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d63b02940747..fc6f4121b412 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,6 +100,7 @@  struct thread_info {
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
+#define TIF_RESCHED_ALLOW	30	/* reschedule if needed */
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
@@ -122,6 +123,7 @@  struct thread_info {
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
+#define _TIF_RESCHED_ALLOW	(1 << TIF_RESCHED_ALLOW)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..4dd3d91d990f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2245,6 +2245,36 @@  static __always_inline bool need_resched(void)
 	return unlikely(tif_need_resched());
 }
 
+#ifdef TIF_RESCHED_ALLOW
+/*
+ * allow_resched() .. disallow_resched() demarcate a preemptible section.
+ *
+ * Used around primitives where it might not be convenient to periodically
+ * call cond_resched().
+ */
+static inline void allow_resched(void)
+{
+	might_sleep();
+	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static inline void disallow_resched(void)
+{
+	clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static __always_inline bool resched_allowed(void)
+{
+	return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+}
+
+#else
+static __always_inline bool resched_allowed(void)
+{
+	return false;
+}
+#endif /* TIF_RESCHED_ALLOW */
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */