diff mbox series

init/main: Clear boot task idle flag

Message ID 20230913005647.1534747-1-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series init/main: Clear boot task idle flag | expand

Commit Message

Liam R. Howlett Sept. 13, 2023, 12:56 a.m. UTC
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle().  Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set.  Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed.  Recent changes
have caused just this scenario and IRQs have been enabled early.

This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.

Fix this issue by clearing the PF_IDLE flag on return from sched_init()
and restore the flag in rest_init().  Although the boot task was marked
as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
do so.  The forced context-switch on idle task was introduced in the
tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.

Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Cc: stable@vger.kernel.org
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Sept. 13, 2023, 11:01 a.m. UTC | #1
On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> Initial booting is setting the task flag to idle (PF_IDLE) by the call
> path sched_init() -> init_idle().  Having the task idle and calling
> call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
> set.  Subsequent calls to any cond_resched() will enable IRQs,
> potentially earlier than the IRQ setup has completed.  Recent changes
> have caused just this scenario and IRQs have been enabled early.
> 
> This causes a warning later in start_kernel() as interrupts are enabled
> before they are fully set up.
> 
> Fix this issue by clearing the PF_IDLE flag on return from sched_init()
> and restore the flag in rest_init().  Although the boot task was marked
> as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
> do so.  The forced context-switch on idle task was introduced in the
> tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.
> 
> Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
> Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
> Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
> Cc: stable@vger.kernel.org
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peng Zhang <zhangpeng.00@bytedance.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  init/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/init/main.c b/init/main.c
> index ad920fac325c..f74772acf612 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
>  	 */
>  	rcu_read_lock();
>  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> -	tsk->flags |= PF_NO_SETAFFINITY;
> +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
>  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
>  	rcu_read_unlock();
>  
> @@ -938,6 +938,8 @@ void start_kernel(void)
>  	 * time - but meanwhile we still have a functioning scheduler.
>  	 */
>  	sched_init();
> +	/* Avoid early context switch, rest_init() restores PF_IDLE */
> +	current->flags &= ~PF_IDLE;
>  
>  	if (WARN(!irqs_disabled(),
>  		 "Interrupts were enabled *very* early, fixing it\n"))


Hurmph... so since this is about IRQs, would it not make sense to have
the | PF_IDLE near 'early_boot_irqs_disabled = false' ?

Or, alternatively, make the tinyrcu thing check that variable?
Paul E. McKenney Sept. 13, 2023, 11:28 a.m. UTC | #2
On Wed, Sep 13, 2023 at 01:01:39PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > Initial booting is setting the task flag to idle (PF_IDLE) by the call
> > path sched_init() -> init_idle().  Having the task idle and calling
> > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
> > set.  Subsequent calls to any cond_resched() will enable IRQs,
> > potentially earlier than the IRQ setup has completed.  Recent changes
> > have caused just this scenario and IRQs have been enabled early.
> > 
> > This causes a warning later in start_kernel() as interrupts are enabled
> > before they are fully set up.
> > 
> > Fix this issue by clearing the PF_IDLE flag on return from sched_init()
> > and restore the flag in rest_init().  Although the boot task was marked
> > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
> > do so.  The forced context-switch on idle task was introduced in the
> > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.
> > 
> > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
> > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
> > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
> > Cc: stable@vger.kernel.org
> > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> > Cc: Andreas Schwab <schwab@linux-m68k.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Peng Zhang <zhangpeng.00@bytedance.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  init/main.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/init/main.c b/init/main.c
> > index ad920fac325c..f74772acf612 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> >  	 */
> >  	rcu_read_lock();
> >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > -	tsk->flags |= PF_NO_SETAFFINITY;
> > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> >  	rcu_read_unlock();
> >  
> > @@ -938,6 +938,8 @@ void start_kernel(void)
> >  	 * time - but meanwhile we still have a functioning scheduler.
> >  	 */
> >  	sched_init();
> > +	/* Avoid early context switch, rest_init() restores PF_IDLE */
> > +	current->flags &= ~PF_IDLE;
> >  
> >  	if (WARN(!irqs_disabled(),
> >  		 "Interrupts were enabled *very* early, fixing it\n"))
> 
> Hurmph... so since this is about IRQs, would it not make sense to have
> the | PF_IDLE near 'early_boot_irqs_disabled = false' ?
> 
> Or, alternatively, make the tinyrcu thing check that variable?

We could do that, but do we really the decidedly non-idle early boot
sequence designated as idle?

What surprises me is that this is just now showing up.  The ingredients
for this one have been in place for almost 10 years.

							Thanx, Paul
Geert Uytterhoeven Sept. 13, 2023, 12:58 p.m. UTC | #3
Hi Liam,

On Wed, Sep 13, 2023 at 2:57 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Initial booting is setting the task flag to idle (PF_IDLE) by the call
> path sched_init() -> init_idle().  Having the task idle and calling
> call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
> set.  Subsequent calls to any cond_resched() will enable IRQs,
> potentially earlier than the IRQ setup has completed.  Recent changes
> have caused just this scenario and IRQs have been enabled early.
>
> This causes a warning later in start_kernel() as interrupts are enabled
> before they are fully set up.
>
> Fix this issue by clearing the PF_IDLE flag on return from sched_init()
> and restore the flag in rest_init().  Although the boot task was marked
> as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
> do so.  The forced context-switch on idle task was introduced in the
> tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.
>
> Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
> Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
> Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
> Cc: stable@vger.kernel.org
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peng Zhang <zhangpeng.00@bytedance.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Thanks for your patch!

This fixes the

      WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x2f0/0x480

I was seeing during boot on Renesas RZ/A1 and RZ/A2 since commit
cfeb6ae8bcb96ccf ("maple_tree: disable mas_wr_append() when other
readers are possible") in v6.5.

However, this causes lots of new warnings on e.g. R-Car M2-W, which
wasn't affected before, like:

+BUG: sleeping function called from invalid context at
kernel/sched/completion.c:111
+in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
+preempt_count: 0, expected: 0
+no locks held by swapper/0/1.
+CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
6.6.0-rc1-koelsch-02359-g9a5bed6a8bd8 #1864
+Hardware name: Generic R-Car Gen2 (Flattened Device Tree)
+ unwind_backtrace from show_stack+0x10/0x14
+ show_stack from dump_stack_lvl+0x68/0x90
+ dump_stack_lvl from __might_resched+0x1b0/0x22c
+ __might_resched from __wait_for_common+0x38/0x170
+ __wait_for_common from kernel_init+0x10/0x130
+ kernel_init from ret_from_fork+0x14/0x30
+Exception stack(0xf0815fb0 to 0xf0815ff8)
+5fa0:                                     00000000 00000000 00000000 00000000
+5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
+5fe0: 00000000 00000000 00000000 00000000 00000013 00000000

and

+BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:580
+in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd
+preempt_count: 0, expected: 0
+no locks held by systemd/1.
+CPU: 1 PID: 1 Comm: systemd Tainted: G        W        N
6.6.0-rc1-koelsch-02359-g9a5bed6a8bd8 #1864
+Hardware name: Generic R-Car Gen2 (Flattened Device Tree)
+ unwind_backtrace from show_stack+0x10/0x14
+ show_stack from dump_stack_lvl+0x68/0x90
+ dump_stack_lvl from __might_resched+0x1b0/0x22c
+ __might_resched from __mutex_lock+0x34/0x3a8
+ __mutex_lock from mutex_lock_nested+0x18/0x20
+ mutex_lock_nested from do_epoll_wait+0x220/0x5ac
+ do_epoll_wait from sys_epoll_wait+0x44/0x6c
+ sys_epoll_wait from ret_fast_syscall+0x0/0x1c
+Exception stack(0xf0815fa8 to 0xf0815ff0)
+5fa0:                   beff75e0 00000004 00000004 beff75e0 0000003b ffffffff
+5fc0: beff75e0 00000004 ffffffff 000000fc 00000000 ffffffff 0000003b 00000004
+5fe0: 000000fc beff75b8 b6ef8511 b6e75786

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Liam R. Howlett Sept. 13, 2023, 1:18 p.m. UTC | #4
* Paul E. McKenney <paulmck@kernel.org> [230913 07:28]:
> On Wed, Sep 13, 2023 at 01:01:39PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > > Initial booting is setting the task flag to idle (PF_IDLE) by the call
> > > path sched_init() -> init_idle().  Having the task idle and calling
> > > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
> > > set.  Subsequent calls to any cond_resched() will enable IRQs,
> > > potentially earlier than the IRQ setup has completed.  Recent changes
> > > have caused just this scenario and IRQs have been enabled early.
> > > 
> > > This causes a warning later in start_kernel() as interrupts are enabled
> > > before they are fully set up.
> > > 
> > > Fix this issue by clearing the PF_IDLE flag on return from sched_init()
> > > and restore the flag in rest_init().  Although the boot task was marked
> > > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
> > > do so.  The forced context-switch on idle task was introduced in the
> > > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.
> > > 
> > > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
> > > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
> > > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
> > > Cc: stable@vger.kernel.org
> > > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> > > Cc: Andreas Schwab <schwab@linux-m68k.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Peng Zhang <zhangpeng.00@bytedance.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > ---
> > >  init/main.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/init/main.c b/init/main.c
> > > index ad920fac325c..f74772acf612 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> > >  	 */
> > >  	rcu_read_lock();
> > >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > > -	tsk->flags |= PF_NO_SETAFFINITY;
> > > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> > >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> > >  	rcu_read_unlock();
> > >  
> > > @@ -938,6 +938,8 @@ void start_kernel(void)
> > >  	 * time - but meanwhile we still have a functioning scheduler.
> > >  	 */
> > >  	sched_init();
> > > +	/* Avoid early context switch, rest_init() restores PF_IDLE */
> > > +	current->flags &= ~PF_IDLE;
> > >  
> > >  	if (WARN(!irqs_disabled(),
> > >  		 "Interrupts were enabled *very* early, fixing it\n"))
> > 
> > Hurmph... so since this is about IRQs, would it not make sense to have
> > the | PF_IDLE near 'early_boot_irqs_disabled = false' ?

I was thinking that this isn't an idle thread until the end of boot, so
leave it set as not idle until the end.

> > 
> > Or, alternatively, make the tinyrcu thing check that variable?
> 
> We could do that, but do we really the decidedly non-idle early boot
> sequence designated as idle?

call_rcu() tiny is called more than this code, so unless there is
another reason we want to check the IRQ status it seemed best to change
the boot task flag.  I mean, I think the is_idle_task() check is valid
in every other context, right?

> 
> What surprises me is that this is just now showing up.  The ingredients
> for this one have been in place for almost 10 years.
> 

Am I just that lucky?

Cheers,
Liam
Peter Zijlstra Sept. 13, 2023, 1:52 p.m. UTC | #5
On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:

> diff --git a/init/main.c b/init/main.c
> index ad920fac325c..f74772acf612 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
>  	 */
>  	rcu_read_lock();
>  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> -	tsk->flags |= PF_NO_SETAFFINITY;
> +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
>  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
>  	rcu_read_unlock();
>  

Hmm, isn't that pid-1 you're setting PF_IDLE on?

The task becoming idle is 'current' at this point, see the
cpu_startup_entry() call below.

Would not something like so be the right thing?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..802551e0009b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	 * PF_KTHREAD should already be set at this point; regardless, make it
 	 * look like a proper per-CPU kthread.
 	 */
-	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
+	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
 	kthread_set_per_cpu(idle, cpu);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 342f58a329f5..5007b25c5bc6 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
 
 void cpu_startup_entry(enum cpuhp_state state)
 {
+	current->flags |= PF_IDLE;
 	arch_cpu_idle_prepare();
 	cpuhp_online_idle(state);
 	while (1)
Liam R. Howlett Sept. 13, 2023, 2:51 p.m. UTC | #6
* Peter Zijlstra <peterz@infradead.org> [230913 09:53]:
> On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> 
> > diff --git a/init/main.c b/init/main.c
> > index ad920fac325c..f74772acf612 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> >  	 */
> >  	rcu_read_lock();
> >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > -	tsk->flags |= PF_NO_SETAFFINITY;
> > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> >  	rcu_read_unlock();
> >  
> 
> Hmm, isn't that pid-1 you're setting PF_IDLE on?

Yes, thanks.  I think that is what Geert is hitting with my patch.

debug __might_resched() in kernel/sched/core.c is failing to return in
that first (complex) if statement.  His report says pid 1 so this is
likely the issue.

> 
> The task becoming idle is 'current' at this point, see the
> cpu_startup_entry() call below.
> 
> Would not something like so be the right thing?
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2299a5cfbfb9..802551e0009b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
>  	 * PF_KTHREAD should already be set at this point; regardless, make it
>  	 * look like a proper per-CPU kthread.
>  	 */
> -	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
> +	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;

I am concerned this will alter more than just the current task, which
would mean more modifications later.  There is a comment about it being
called 'more than once' and 'per cpu' so I am hesitant to change the
function itself.

Although I am unsure of the call path.. fork_idle() -> init_idle() I
guess?

>  	kthread_set_per_cpu(idle, cpu);
>  
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 342f58a329f5..5007b25c5bc6 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
>  
>  void cpu_startup_entry(enum cpuhp_state state)
>  {
> +	current->flags |= PF_IDLE;
>  	arch_cpu_idle_prepare();
>  	cpuhp_online_idle(state);
>  	while (1)
Peter Zijlstra Sept. 13, 2023, 4:12 p.m. UTC | #7
On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote:
> * Peter Zijlstra <peterz@infradead.org> [230913 09:53]:
> > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > 
> > > diff --git a/init/main.c b/init/main.c
> > > index ad920fac325c..f74772acf612 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> > >  	 */
> > >  	rcu_read_lock();
> > >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > > -	tsk->flags |= PF_NO_SETAFFINITY;
> > > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> > >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> > >  	rcu_read_unlock();
> > >  
> > 
> > Hmm, isn't that pid-1 you're setting PF_IDLE on?
> 
> Yes, thanks.  I think that is what Geert is hitting with my patch.
> 
> debug __might_resched() in kernel/sched/core.c is failing to return in
> that first (complex) if statement.  His report says pid 1 so this is
> likely the issue.
> 
> > 
> > The task becoming idle is 'current' at this point, see the
> > cpu_startup_entry() call below.
> > 
> > Would not something like so be the right thing?
> > 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2299a5cfbfb9..802551e0009b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
> >  	 * PF_KTHREAD should already be set at this point; regardless, make it
> >  	 * look like a proper per-CPU kthread.
> >  	 */
> > -	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
> > +	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
> 
> I am concerned this will alter more than just the current task, which
> would mean more modifications later.  There is a comment about it being
> called 'more than once' and 'per cpu' so I am hesitant to change the
> function itself.
> 
> Although I am unsure of the call path.. fork_idle() -> init_idle() I
> guess?

There's only 2 ways to get into do_idle(), through cpu_startup_entry()
and play_idle_precise(). The latter already frobs PF_IDLE since it is
the forced idle path, this then leaves cpu_startup_entry() which is the
regular idle path.

All idle threads will end up calling into it, the boot CPU through the
rest_init() and the SMP cpus through arch SMP bringup.

IOW, this ensures all idle loops will have PF_IDLE set but not the
pre-idle loop setup code these threads run.
Liam R. Howlett Sept. 13, 2023, 5:32 p.m. UTC | #8
* Peter Zijlstra <peterz@infradead.org> [230913 12:13]:
> On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote:
> > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]:
> > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > > 
> > > > diff --git a/init/main.c b/init/main.c
> > > > index ad920fac325c..f74772acf612 100644
> > > > --- a/init/main.c
> > > > +++ b/init/main.c
> > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> > > >  	 */
> > > >  	rcu_read_lock();
> > > >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > > > -	tsk->flags |= PF_NO_SETAFFINITY;
> > > > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> > > >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> > > >  	rcu_read_unlock();
> > > >  
> > > 
> > > Hmm, isn't that pid-1 you're setting PF_IDLE on?
> > 
> > Yes, thanks.  I think that is what Geert is hitting with my patch.
> > 
> > debug __might_resched() in kernel/sched/core.c is failing to return in
> > that first (complex) if statement.  His report says pid 1 so this is
> > likely the issue.
> > 
> > > 
> > > The task becoming idle is 'current' at this point, see the
> > > cpu_startup_entry() call below.
> > > 
> > > Would not something like so be the right thing?
> > > 
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 2299a5cfbfb9..802551e0009b 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
> > >  	 * PF_KTHREAD should already be set at this point; regardless, make it
> > >  	 * look like a proper per-CPU kthread.
> > >  	 */
> > > -	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
> > > +	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
> > 
> > I am concerned this will alter more than just the current task, which
> > would mean more modifications later.  There is a comment about it being
> > called 'more than once' and 'per cpu' so I am hesitant to change the
> > function itself.
> > 
> > Although I am unsure of the call path.. fork_idle() -> init_idle() I
> > guess?
> 
> There's only 2 ways to get into do_idle(), through cpu_startup_entry()
> and play_idle_precise(). The latter already frobs PF_IDLE since it is
> the forced idle path, this then leaves cpu_startup_entry() which is the
> regular idle path.
> 
> All idle threads will end up calling into it, the boot CPU through the
> rest_init() and the SMP cpus through arch SMP bringup.
> 
> IOW, this ensures all idle loops will have PF_IDLE set but not the
> pre-idle loop setup code these threads run.

Thanks for the information.  This does leave the init_idle() function in
the odd state of not setting PF_IDLE, but I guess that's okay?
Peter Zijlstra Sept. 14, 2023, 7:13 a.m. UTC | #9
On Wed, Sep 13, 2023 at 01:32:38PM -0400, Liam R. Howlett wrote:
> * Peter Zijlstra <peterz@infradead.org> [230913 12:13]:
> > On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote:
> > > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]:
> > > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > > > 
> > > > > diff --git a/init/main.c b/init/main.c
> > > > > index ad920fac325c..f74772acf612 100644
> > > > > --- a/init/main.c
> > > > > +++ b/init/main.c
> > > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> > > > >  	 */
> > > > >  	rcu_read_lock();
> > > > >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > > > > -	tsk->flags |= PF_NO_SETAFFINITY;
> > > > > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> > > > >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> > > > >  	rcu_read_unlock();
> > > > >  
> > > > 
> > > > Hmm, isn't that pid-1 you're setting PF_IDLE on?
> > > 
> > > Yes, thanks.  I think that is what Geert is hitting with my patch.
> > > 
> > > debug __might_resched() in kernel/sched/core.c is failing to return in
> > > that first (complex) if statement.  His report says pid 1 so this is
> > > likely the issue.
> > > 
> > > > 
> > > > The task becoming idle is 'current' at this point, see the
> > > > cpu_startup_entry() call below.
> > > > 
> > > > Would not something like so be the right thing?
> > > > 
> > > > 
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 2299a5cfbfb9..802551e0009b 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
> > > >  	 * PF_KTHREAD should already be set at this point; regardless, make it
> > > >  	 * look like a proper per-CPU kthread.
> > > >  	 */
> > > > -	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
> > > > +	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
> > > 
> > > I am concerned this will alter more than just the current task, which
> > > would mean more modifications later.  There is a comment about it being
> > > called 'more than once' and 'per cpu' so I am hesitant to change the
> > > function itself.
> > > 
> > > Although I am unsure of the call path.. fork_idle() -> init_idle() I
> > > guess?
> > 
> > There's only 2 ways to get into do_idle(), through cpu_startup_entry()
> > and play_idle_precise(). The latter already frobs PF_IDLE since it is
> > the forced idle path, this then leaves cpu_startup_entry() which is the
> > regular idle path.
> > 
> > All idle threads will end up calling into it, the boot CPU through the
> > rest_init() and the SMP cpus through arch SMP bringup.
> > 
> > IOW, this ensures all idle loops will have PF_IDLE set but not the
> > pre-idle loop setup code these threads run.
> 
> Thanks for the information.  This does leave the init_idle() function in
> the odd state of not setting PF_IDLE, but I guess that's okay?

Yep, the few things that care about PF_IDLE seem to really only care
about do_idle() and very much not (per the rcutiny thing) any code that
comes before it.
Oliver Sang Sept. 14, 2023, 4:05 p.m. UTC | #10
Hello,

kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c" on:

commit: afa19b70494c22ba27224f90c0805e88e5db41e1 ("[PATCH] init/main: Clear boot task idle flag")
url: https://github.com/intel-lab-lkp/linux/commits/Liam-R-Howlett/init-main-Clear-boot-task-idle-flag/20230913-085921
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3669558bdf354cd352be955ef2764cde6a9bf5ec
patch link: https://lore.kernel.org/all/20230913005647.1534747-1-Liam.Howlett@oracle.com/
patch subject: [PATCH] init/main: Clear boot task idle flag

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


besides BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c,
we also observed other issues on this commit but clean on parent:


3669558bdf354cd3 afa19b70494c22ba27224f90c08
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/dma-buf/selftest.c
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/net/ethernet/intel/e1000_hw.c
           :6          100%           6:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h
           :6           83%           5:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c
           :6           50%           3:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/rcu/tree.c
           :6          100%           6:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c
           :6          100%           6:6     dmesg.RIP:__do_softirq
           :6          100%           6:6     dmesg.RIP:__flush_smp_call_function_queue
           :6           33%           2:6     dmesg.RIP:__handle_irq_event_percpu
           :6          100%           6:6     dmesg.RIP:__hrtimer_run_queues
           :6          100%           6:6     dmesg.RIP:__note_gp_changes
           :6          100%           6:6     dmesg.RIP:__queue_work
           :6          100%           6:6     dmesg.RIP:__raise_softirq_irqoff
           :6          100%           6:6     dmesg.RIP:__run_timers
           :6          100%           6:6     dmesg.RIP:__smp_call_single_queue
           :6          100%           6:6     dmesg.RIP:__sysvec_apic_timer_interrupt
           :6          100%           6:6     dmesg.RIP:__sysvec_call_function_single
           :6           50%           3:6     dmesg.RIP:__update_load_avg_blocked_se
           :6          100%           6:6     dmesg.RIP:__update_load_avg_cfs_rq
           :6          100%           6:6     dmesg.RIP:__update_load_avg_se
           :6          100%           6:6     dmesg.RIP:__wake_nocb_gp
           :6           67%           4:6     dmesg.RIP:_raw_spin_unlock_irqrestore
           :6          100%           6:6     dmesg.RIP:arch_ftrace_ops_list_func
           :6          100%           6:6     dmesg.RIP:attach_entity_load_avg
           :6          100%           6:6     dmesg.RIP:call_timer_fn
           :6          100%           6:6     dmesg.RIP:default_idle
           :6          100%           6:6     dmesg.RIP:dequeue_task_fair
           :6          100%           6:6     dmesg.RIP:detach_entity_load_avg
           :6           33%           2:6     dmesg.RIP:detach_if_pending
           :6           67%           4:6     dmesg.RIP:dma_fence_signal_timestamp_locked
           :6           33%           2:6     dmesg.RIP:do_nocb_deferred_wakeup
           :6          100%           6:6     dmesg.RIP:do_nocb_deferred_wakeup_timer
           :6          100%           6:6     dmesg.RIP:enqueue_hrtimer
           :6          100%           6:6     dmesg.RIP:enqueue_task_fair
           :6          100%           6:6     dmesg.RIP:enqueue_timer
           :6           17%           1:6     dmesg.RIP:hrtimer_init
           :6           67%           4:6     dmesg.RIP:load_balance
           :6          100%           6:6     dmesg.RIP:lock_acquire
           :6          100%           6:6     dmesg.RIP:lock_acquired
           :6          100%           6:6     dmesg.RIP:lock_contended
           :6          100%           6:6     dmesg.RIP:lock_release
           :6           33%           2:6     dmesg.RIP:printk_sprint
           :6           17%           1:6     dmesg.RIP:rcu_barrier_trace
           :6          100%           6:6     dmesg.RIP:rcu_core
           :6          100%           6:6     dmesg.RIP:rcu_is_cpu_rrupt_from_idle
           :6           83%           5:6     dmesg.RIP:rcu_qs
           :6          100%           6:6     dmesg.RIP:rcu_report_qs_rnp
           :6          100%           6:6     dmesg.RIP:rcu_sched_clock_irq
           :6           50%           3:6     dmesg.RIP:resched_curr
           :6          100%           6:6     dmesg.RIP:set_task_cpu
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_off
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_off_finish
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_on
           :6          100%           6:6     dmesg.RIP:trace_reschedule_entry
           :6          100%           6:6     dmesg.RIP:trace_reschedule_exit
           :6          100%           6:6     dmesg.RIP:try_to_wake_up
           :6          100%           6:6     dmesg.RIP:ttwu_do_activate
           :6          100%           6:6     dmesg.RIP:update_cpu_capacity
           :6          100%           6:6     dmesg.RIP:update_curr
           :6          100%           6:6     dmesg.RIP:update_dl_rq_load_avg
           :6          100%           6:6     dmesg.RIP:update_irq_load_avg
           :6          100%           6:6     dmesg.RIP:update_rt_rq_load_avg
           :6          100%           6:6     dmesg.RIP:update_sd_lb_stats
           :6           17%           1:6     dmesg.RIP:wake_up_nohz_cpu
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_apic_timer_interrupt
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_call_function_single
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_entry
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_exit
           :6          100%           6:6     dmesg.WARNING:at_include/linux/trace_recursion.h:#arch_ftrace_ops_list_func
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/csd.h:#__flush_smp_call_function_queue
           :6           67%           4:6     dmesg.WARNING:at_include/trace/events/dma_fence.h:#dma_fence_signal_timestamp_locked
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/ipi.h:#__smp_call_single_queue
           :6           50%           3:6     dmesg.WARNING:at_include/trace/events/ipi.h:#resched_curr
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/ipi.h:#wake_up_nohz_cpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/irq.h:#__do_softirq
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/irq.h:#__handle_irq_event_percpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/irq.h:#__raise_softirq_irqoff
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquire
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquired
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_contended
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_release
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off_finish
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_on
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/printk.h:#printk_sprint
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#__note_gp_changes
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#__wake_nocb_gp
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup_timer
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_barrier_trace
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_core
           :6           83%           5:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_qs
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_report_qs_rnp
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_sched_clock_irq
           :6           50%           3:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_blocked_se
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_cfs_rq
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_se
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#attach_entity_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#dequeue_task_fair
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#detach_entity_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#enqueue_task_fair
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#set_task_cpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#try_to_wake_up
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#ttwu_do_activate
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_cpu_capacity
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_curr
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_dl_rq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_irq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_rt_rq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_sd_lb_stats
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#__hrtimer_run_queues
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#__run_timers
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#call_timer_fn
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/timer.h:#detach_if_pending
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_hrtimer
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_timer
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/timer.h:#hrtimer_init
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/workqueue.h:#__queue_work
           :6          100%           6:6     dmesg.WARNING:at_kernel/rcu/tree.c:#rcu_is_cpu_rrupt_from_idle
           :6           17%           1:6     dmesg.WARNING:inconsistent_lock_state
           :6          100%           6:6     dmesg.WARNING:suspicious_RCU_usage
           :6           33%           2:6     dmesg.arch/x86/kernel/irq.c:#IRQ_failed_to_wake_up_RCU
           :6          100%           6:6     dmesg.include/linux/cgroup.h:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.include/linux/rcupdate.h:#rcu_read_lock()used_illegally_while_idle
           :6          100%           6:6     dmesg.include/linux/rcupdate.h:#rcu_read_unlock()used_illegally_while_idle
           :6           17%           1:6     dmesg.inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage
           :6          100%           6:6     dmesg.kernel/events/core.c:#suspicious_rcu_dereference_check()usage
           :6           83%           5:6     dmesg.kernel/exit.c:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.kernel/rcu/tree.c:#RCU_dynticks_nmi_nesting_counter_underflow/zero
           :6           67%           4:6     dmesg.kernel/sched/core.c:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.kernel/sched/fair.c:#suspicious_rcu_dereference_check()usage
           :6           83%           5:6     dmesg.kernel/workqueue.c:#RCU_or_wq_pool_mutex_should_be_held
           :6           83%           5:6     dmesg.kernel/workqueue.c:#suspicious_rcu_dereference_check()usage



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309142334.a345320b-oliver.sang@intel.com


[    2.318233][    T1] BUG: sleeping function called from invalid context at kernel/sched/completion.c:111
[    2.318647][    T1] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
[    2.319725][    T1] preempt_count: 0, expected: 0
[    2.320293][    T1] no locks held by swapper/0/1.
[    2.320882][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc1-00034-gafa19b70494c #1
[    2.321957][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[    2.321957][    T1] Call Trace:
[    2.321957][    T1]  <TASK>
[ 2.321957][ T1] dump_stack_lvl (lib/dump_stack.c:107) 
[ 2.321957][ T1] __might_resched (kernel/sched/core.c:10188) 
[ 2.321957][ T1] ? sched_core_find (kernel/sched/core.c:10142) 
[ 2.321957][ T1] __wait_for_common (kernel/sched/completion.c:115) 
[ 2.321957][ T1] ? usleep_range_state (kernel/time/timer.c:2129) 
[ 2.321957][ T1] ? reacquire_held_locks (kernel/locking/lockdep.c:5404) 
[ 2.321957][ T1] ? wait_for_completion_io_timeout (kernel/sched/completion.c:110) 
[ 2.321957][ T1] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:121) 
[ 2.321957][ T1] ? rest_init (init/main.c:1431) 
[ 2.321957][ T1] kernel_init (init/main.c:1439) 
[ 2.321957][ T1] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) 
[ 2.321957][ T1] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 2.321957][ T1] ? rest_init (init/main.c:1431) 
[ 2.321957][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[    2.321957][    T1]  </TASK>
[    2.322276][    T1] smpboot: CPU0: Intel Xeon E312xx (Sandy Bridge) (family: 0x6, model: 0x2a, stepping: 0x1)
[    2.327212][    T1] RCU Tasks: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.328461][    T1] RCU Tasks Rude: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.330045][    T1] RCU Tasks Trace: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.331385][    T1] Running RCU-tasks wait API self tests
[    2.333117][    T1] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only.
[    2.334654][    T1] signal: max sigframe size: 1776
[    2.335915][    T1] rcu: Hierarchical SRCU implementation.
[    2.336653][    T1] rcu: 	Max phase no-delay instances is 1000.
[    2.342550][   T14] Callback from call_rcu_tasks_trace() invoked.
[    2.344299][    T1] smp: Bringing up secondary CPUs ...
[    2.347251][    T1] smpboot: x86: Booting SMP configuration:
[    2.347999][    T1] .... node  #0, CPUs:      #1
[    0.487313][    T0] masked ExtINT on CPU#1
[    2.348624][    T0] ------------[ cut here ]------------
[ 2.348624][ T0] WARNING: CPU: 0 PID: 0 at include/trace/events/preemptirq.h:36 trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[    2.348624][    T0] Modules linked in:
[    2.348624][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.6.0-rc1-00034-gafa19b70494c #1
[    2.348624][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 2.348624][ T0] RIP: 0010:trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] Code: 03 80 3c 02 00 75 2a 48 8b 7b 08 31 d2 48 89 ee e8 99 f4 ff ff 65 ff 0d 0a 69 ce 7e e9 5d ff ff ff e8 38 18 e9 ff 84 c0 75 80 <0f> 0b e9 79 ff ff ff e8 c8 c0 3a 00 eb cf 66 0f 1f 44 00 00 65 8b
All code
========
   0:	03 80 3c 02 00 75    	add    0x7500023c(%rax),%eax
   6:	2a 48 8b             	sub    -0x75(%rax),%cl
   9:	7b 08                	jnp    0x13
   b:	31 d2                	xor    %edx,%edx
   d:	48 89 ee             	mov    %rbp,%rsi
  10:	e8 99 f4 ff ff       	call   0xfffffffffffff4ae
  15:	65 ff 0d 0a 69 ce 7e 	decl   %gs:0x7ece690a(%rip)        # 0x7ece6926
  1c:	e9 5d ff ff ff       	jmp    0xffffffffffffff7e
  21:	e8 38 18 e9 ff       	call   0xffffffffffe9185e
  26:	84 c0                	test   %al,%al
  28:	75 80                	jne    0xffffffffffffffaa
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 79 ff ff ff       	jmp    0xffffffffffffffaa
  31:	e8 c8 c0 3a 00       	call   0x3ac0fe
  36:	eb cf                	jmp    0x7
  38:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  3e:	65                   	gs
  3f:	8b                   	.byte 0x8b

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 79 ff ff ff       	jmp    0xffffffffffffff80
   7:	e8 c8 c0 3a 00       	call   0x3ac0d4
   c:	eb cf                	jmp    0xffffffffffffffdd
   e:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  14:	65                   	gs
  15:	8b                   	.byte 0x8b
[    2.348624][    T0] RSP: 0000:ffffffff84a07d60 EFLAGS: 00010046
[    2.348624][    T0] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81513ac2
[    2.348624][    T0] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff85a79540
[    2.348624][    T0] RBP: ffffffff8352c418 R08: 0000000000000000 R09: fffffbfff0b4f2a8
[    2.348624][    T0] R10: ffffffff85a79547 R11: 0000000000000000 R12: 0000000000000000
[    2.348624][    T0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.348624][    T0] FS:  0000000000000000(0000) GS:ffff8883aea00000(0000) knlGS:0000000000000000
[    2.348624][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.348624][    T0] CR2: ffff88843ffff000 CR3: 0000000004a89000 CR4: 00000000000406f0
[    2.348624][    T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.348624][    T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.348624][    T0] Call Trace:
[    2.348624][    T0]  <TASK>
[ 2.348624][ T0] ? __warn (kernel/panic.c:673) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 2.348624][ T0] ? handle_bug (arch/x86/kernel/traps.c:237) 
[ 2.348624][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 2.348624][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 2.348624][ T0] ? irqentry_enter (kernel/entry/common.c:315) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1082 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] irqentry_enter (kernel/entry/common.c:315) 
[ 2.348624][ T0] sysvec_call_function_single (arch/x86/kernel/smp.c:287) 
[ 2.348624][ T0] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652) 
[ 2.348624][ T0] RIP: 0010:default_idle (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/kernel/process.c:750) 
[ 2.348624][ T0] Code: 07 76 e7 48 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc eb 07 0f 00 2d 37 79 30 00 fb f4 <fa> c3 0f 1f 00 65 48 8b 04 25 40 a4 1f 00 f0 80 48 02 20 48 8b 10
All code
========
   0:	07                   	(bad)
   1:	76 e7                	jbe    0xffffffffffffffea
   3:	48 89 07             	mov    %rax,(%rdi)
   6:	49 c7 c0 08 00 00 00 	mov    $0x8,%r8
   d:	4d 29 c8             	sub    %r9,%r8
  10:	4c 01 c7             	add    %r8,%rdi
  13:	4c 29 c2             	sub    %r8,%rdx
  16:	e9 72 ff ff ff       	jmp    0xffffffffffffff8d
  1b:	cc                   	int3
  1c:	cc                   	int3
  1d:	cc                   	int3
  1e:	cc                   	int3
  1f:	eb 07                	jmp    0x28
  21:	0f 00 2d 37 79 30 00 	verw   0x307937(%rip)        # 0x30795f
  28:	fb                   	sti
  29:	f4                   	hlt
  2a:*	fa                   	cli		<-- trapping instruction
  2b:	c3                   	ret
  2c:	0f 1f 00             	nopl   (%rax)
  2f:	65 48 8b 04 25 40 a4 	mov    %gs:0x1fa440,%rax
  36:	1f 00 
  38:	f0 80 48 02 20       	lock orb $0x20,0x2(%rax)
  3d:	48 8b 10             	mov    (%rax),%rdx

Code starting with the faulting instruction
===========================================
   0:	fa                   	cli
   1:	c3                   	ret
   2:	0f 1f 00             	nopl   (%rax)
   5:	65 48 8b 04 25 40 a4 	mov    %gs:0x1fa440,%rax
   c:	1f 00 
   e:	f0 80 48 02 20       	lock orb $0x20,0x2(%rax)
  13:	48 8b 10             	mov    (%rax),%rdx
[    2.348624][    T0] RSP: 0000:ffffffff84a07e40 EFLAGS: 00000206
[    2.348624][    T0] RAX: 0000000000003056 RBX: 0000000000000000 RCX: ffffffff8352c7ba
[    2.348624][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812e3bd3
[    2.348624][    T0] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1075d7ed9d
[    2.348624][    T0] R10: ffff8883aebf6ceb R11: 0000000000000000 R12: 1ffffffff0940fca
[    2.348624][    T0] R13: ffffffff84a1e940 R14: dffffc0000000000 R15: 0000000000013f10
[ 2.348624][ T0] ? ct_kernel_exit+0xca/0x100 
[ 2.348624][ T0] ? cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] default_idle_call (include/linux/cpuidle.h:143 kernel/sched/idle.c:98) 
[ 2.348624][ T0] cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] ? arch_cpu_idle_exit+0x30/0x30 
[ 2.348624][ T0] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) 
[ 2.348624][ T0] do_idle (kernel/sched/idle.c:282) 
[ 2.348624][ T0] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1)) 
[ 2.348624][ T0] rest_init (include/linux/rcupdate.h:750 init/main.c:697) 
[ 2.348624][ T0] arch_call_rest_init+0xf/0x20 
[ 2.348624][ T0] start_kernel (init/main.c:1021 (discriminator 1)) 
[ 2.348624][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:544) 
[ 2.348624][ T0] x86_64_start_kernel (??:?) 
[ 2.348624][ T0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) 
[    2.348624][    T0]  </TASK>
[    2.348624][    T0] irq event stamp: 12375
[ 2.348624][ T0] hardirqs last enabled at (12374): cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] hardirqs last disabled at (12375): sysvec_call_function_single (arch/x86/kernel/smp.c:287) 
[ 2.348624][ T0] softirqs last enabled at (8563): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) 
[ 2.348624][ T0] softirqs last disabled at (8558): __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632) 
[    2.348624][    T0] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230914/202309142334.a345320b-oliver.sang@intel.com
diff mbox series

Patch

diff --git a/init/main.c b/init/main.c
index ad920fac325c..f74772acf612 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,7 @@  noinline void __ref __noreturn rest_init(void)
 	 */
 	rcu_read_lock();
 	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
-	tsk->flags |= PF_NO_SETAFFINITY;
+	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
 	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
 	rcu_read_unlock();
 
@@ -938,6 +938,8 @@  void start_kernel(void)
 	 * time - but meanwhile we still have a functioning scheduler.
 	 */
 	sched_init();
+	/* Avoid early context switch, rest_init() restores PF_IDLE */
+	current->flags &= ~PF_IDLE;
 
 	if (WARN(!irqs_disabled(),
 		 "Interrupts were enabled *very* early, fixing it\n"))