Message ID | 20230913005647.1534747-1-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | init/main: Clear boot task idle flag | expand |
On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > Initial booting is setting the task flag to idle (PF_IDLE) by the call > path sched_init() -> init_idle(). Having the task idle and calling > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be > set. Subsequent calls to any cond_resched() will enable IRQs, > potentially earlier than the IRQ setup has completed. Recent changes > have caused just this scenario and IRQs have been enabled early. > > This causes a warning later in start_kernel() as interrupts are enabled > before they are fully set up. > > Fix this issue by clearing the PF_IDLE flag on return from sched_init() > and restore the flag in rest_init(). Although the boot task was marked > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to > do so. The forced context-switch on idle task was introduced in the > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. > > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") > Cc: stable@vger.kernel.org > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Christophe Leroy <christophe.leroy@csgroup.eu> > Cc: Andreas Schwab <schwab@linux-m68k.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Peng Zhang <zhangpeng.00@bytedance.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> > --- > init/main.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/init/main.c b/init/main.c > index ad920fac325c..f74772acf612 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > */ > rcu_read_lock(); > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > - tsk->flags |= PF_NO_SETAFFINITY; > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > rcu_read_unlock(); > > @@ -938,6 +938,8 @@ void start_kernel(void) > * time - but meanwhile we still have a functioning scheduler. > */ > sched_init(); > + /* Avoid early context switch, rest_init() restores PF_IDLE */ > + current->flags &= ~PF_IDLE; > > if (WARN(!irqs_disabled(), > "Interrupts were enabled *very* early, fixing it\n")) Hurmph... so since this is about IRQs, would it not make sense to have the | PF_IDLE near 'early_boot_irqs_disabled = false' ? Or, alternatively, make the tinyrcu thing check that variable?
On Wed, Sep 13, 2023 at 01:01:39PM +0200, Peter Zijlstra wrote: > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > Initial booting is setting the task flag to idle (PF_IDLE) by the call > > path sched_init() -> init_idle(). Having the task idle and calling > > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be > > set. Subsequent calls to any cond_resched() will enable IRQs, > > potentially earlier than the IRQ setup has completed. Recent changes > > have caused just this scenario and IRQs have been enabled early. > > > > This causes a warning later in start_kernel() as interrupts are enabled > > before they are fully set up. > > > > Fix this issue by clearing the PF_IDLE flag on return from sched_init() > > and restore the flag in rest_init(). Although the boot task was marked > > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to > > do so. The forced context-switch on idle task was introduced in the > > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. > > > > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ > > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ > > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") > > Cc: stable@vger.kernel.org > > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > Cc: Christophe Leroy <christophe.leroy@csgroup.eu> > > Cc: Andreas Schwab <schwab@linux-m68k.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Peng Zhang <zhangpeng.00@bytedance.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Juri Lelli <juri.lelli@redhat.com> > > Cc: Vincent Guittot <vincent.guittot@linaro.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> > > --- > > init/main.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/init/main.c b/init/main.c > > index ad920fac325c..f74772acf612 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > */ > > rcu_read_lock(); > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > - tsk->flags |= PF_NO_SETAFFINITY; > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > rcu_read_unlock(); > > > > @@ -938,6 +938,8 @@ void start_kernel(void) > > * time - but meanwhile we still have a functioning scheduler. > > */ > > sched_init(); > > + /* Avoid early context switch, rest_init() restores PF_IDLE */ > > + current->flags &= ~PF_IDLE; > > > > if (WARN(!irqs_disabled(), > > "Interrupts were enabled *very* early, fixing it\n")) > > Hurmph... so since this is about IRQs, would it not make sense to have > the | PF_IDLE near 'early_boot_irqs_disabled = false' ? > > Or, alternatively, make the tinyrcu thing check that variable? We could do that, but do we really the decidedly non-idle early boot sequence designated as idle? What surprises me is that this is just now showing up. The ingredients for this one have been in place for almost 10 years. Thanx, Paul
Hi Liam, On Wed, Sep 13, 2023 at 2:57 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > Initial booting is setting the task flag to idle (PF_IDLE) by the call > path sched_init() -> init_idle(). Having the task idle and calling > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be > set. Subsequent calls to any cond_resched() will enable IRQs, > potentially earlier than the IRQ setup has completed. Recent changes > have caused just this scenario and IRQs have been enabled early. > > This causes a warning later in start_kernel() as interrupts are enabled > before they are fully set up. > > Fix this issue by clearing the PF_IDLE flag on return from sched_init() > and restore the flag in rest_init(). Although the boot task was marked > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to > do so. The forced context-switch on idle task was introduced in the > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. > > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") > Cc: stable@vger.kernel.org > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Christophe Leroy <christophe.leroy@csgroup.eu> > Cc: Andreas Schwab <schwab@linux-m68k.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Peng Zhang <zhangpeng.00@bytedance.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Thanks for your patch! This fixes the WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x2f0/0x480 I was seeing during boot on Renesas RZ/A1 and RZ/A2 since commit cfeb6ae8bcb96ccf ("maple_tree: disable mas_wr_append() when other readers are possible") in v6.5. However, this causes lots of new warnings on e.g. R-Car M2-W, which wasn't affected before, like: +BUG: sleeping function called from invalid context at kernel/sched/completion.c:111 +in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 +preempt_count: 0, expected: 0 +no locks held by swapper/0/1. +CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.6.0-rc1-koelsch-02359-g9a5bed6a8bd8 #1864 +Hardware name: Generic R-Car Gen2 (Flattened Device Tree) + unwind_backtrace from show_stack+0x10/0x14 + show_stack from dump_stack_lvl+0x68/0x90 + dump_stack_lvl from __might_resched+0x1b0/0x22c + __might_resched from __wait_for_common+0x38/0x170 + __wait_for_common from kernel_init+0x10/0x130 + kernel_init from ret_from_fork+0x14/0x30 +Exception stack(0xf0815fb0 to 0xf0815ff8) +5fa0: 00000000 00000000 00000000 00000000 +5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 +5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 and +BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 +in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd +preempt_count: 0, expected: 0 +no locks held by systemd/1. +CPU: 1 PID: 1 Comm: systemd Tainted: G W N 6.6.0-rc1-koelsch-02359-g9a5bed6a8bd8 #1864 +Hardware name: Generic R-Car Gen2 (Flattened Device Tree) + unwind_backtrace from show_stack+0x10/0x14 + show_stack from dump_stack_lvl+0x68/0x90 + dump_stack_lvl from __might_resched+0x1b0/0x22c + __might_resched from __mutex_lock+0x34/0x3a8 + __mutex_lock from mutex_lock_nested+0x18/0x20 + mutex_lock_nested from do_epoll_wait+0x220/0x5ac + do_epoll_wait from sys_epoll_wait+0x44/0x6c + sys_epoll_wait from ret_fast_syscall+0x0/0x1c +Exception stack(0xf0815fa8 to 0xf0815ff0) +5fa0: beff75e0 00000004 00000004 beff75e0 0000003b ffffffff +5fc0: beff75e0 00000004 ffffffff 000000fc 00000000 ffffffff 0000003b 00000004 +5fe0: 000000fc beff75b8 b6ef8511 b6e75786 Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
* Paul E. McKenney <paulmck@kernel.org> [230913 07:28]: > On Wed, Sep 13, 2023 at 01:01:39PM +0200, Peter Zijlstra wrote: > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > > Initial booting is setting the task flag to idle (PF_IDLE) by the call > > > path sched_init() -> init_idle(). Having the task idle and calling > > > call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be > > > set. Subsequent calls to any cond_resched() will enable IRQs, > > > potentially earlier than the IRQ setup has completed. Recent changes > > > have caused just this scenario and IRQs have been enabled early. > > > > > > This causes a warning later in start_kernel() as interrupts are enabled > > > before they are fully set up. > > > > > > Fix this issue by clearing the PF_IDLE flag on return from sched_init() > > > and restore the flag in rest_init(). Although the boot task was marked > > > as idle since (at least) d80e4fda576d, I am not sure that it is wrong to > > > do so. The forced context-switch on idle task was introduced in the > > > tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. > > > > > > Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ > > > Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ > > > Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") > > > Cc: stable@vger.kernel.org > > > Cc: Geert Uytterhoeven <geert@linux-m68k.org> > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > Cc: Christophe Leroy <christophe.leroy@csgroup.eu> > > > Cc: Andreas Schwab <schwab@linux-m68k.org> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Peng Zhang <zhangpeng.00@bytedance.com> > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Juri Lelli <juri.lelli@redhat.com> > > > Cc: Vincent Guittot <vincent.guittot@linaro.org> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> > > > --- > > > init/main.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/init/main.c b/init/main.c > > > index ad920fac325c..f74772acf612 100644 > > > --- a/init/main.c > > > +++ b/init/main.c > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > > */ > > > rcu_read_lock(); > > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > > - tsk->flags |= PF_NO_SETAFFINITY; > > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > > rcu_read_unlock(); > > > > > > @@ -938,6 +938,8 @@ void start_kernel(void) > > > * time - but meanwhile we still have a functioning scheduler. > > > */ > > > sched_init(); > > > + /* Avoid early context switch, rest_init() restores PF_IDLE */ > > > + current->flags &= ~PF_IDLE; > > > > > > if (WARN(!irqs_disabled(), > > > "Interrupts were enabled *very* early, fixing it\n")) > > > > Hurmph... so since this is about IRQs, would it not make sense to have > > the | PF_IDLE near 'early_boot_irqs_disabled = false' ? I was thinking that this isn't an idle thread until the end of boot, so leave it set as not idle until the end. > > > > Or, alternatively, make the tinyrcu thing check that variable? > > We could do that, but do we really the decidedly non-idle early boot > sequence designated as idle? call_rcu() tiny is called more than this code, so unless there is another reason we want to check the IRQ status it seemed best to change the boot task flag. I mean, I think the is_idle_task() check is valid in every other context, right? > > What surprises me is that this is just now showing up. The ingredients > for this one have been in place for almost 10 years. > Am I just that lucky? Cheers, Liam
On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > diff --git a/init/main.c b/init/main.c > index ad920fac325c..f74772acf612 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > */ > rcu_read_lock(); > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > - tsk->flags |= PF_NO_SETAFFINITY; > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > rcu_read_unlock(); > Hmm, isn't that pid-1 you're setting PF_IDLE on? The task becoming idle is 'current' at this point, see the cpu_startup_entry() call below. Would not something like so be the right thing? diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2299a5cfbfb9..802551e0009b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) * PF_KTHREAD should already be set at this point; regardless, make it * look like a proper per-CPU kthread. */ - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; kthread_set_per_cpu(idle, cpu); #ifdef CONFIG_SMP diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 342f58a329f5..5007b25c5bc6 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise); void cpu_startup_entry(enum cpuhp_state state) { + current->flags |= PF_IDLE; arch_cpu_idle_prepare(); cpuhp_online_idle(state); while (1)
* Peter Zijlstra <peterz@infradead.org> [230913 09:53]: > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > > diff --git a/init/main.c b/init/main.c > > index ad920fac325c..f74772acf612 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > */ > > rcu_read_lock(); > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > - tsk->flags |= PF_NO_SETAFFINITY; > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > rcu_read_unlock(); > > > > Hmm, isn't that pid-1 you're setting PF_IDLE on? Yes, thanks. I think that is what Geert is hitting with my patch. debug __might_resched() in kernel/sched/core.c is failing to return in that first (complex) if statement. His report says pid 1 so this is likely the issue. > > The task becoming idle is 'current' at this point, see the > cpu_startup_entry() call below. > > Would not something like so be the right thing? > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2299a5cfbfb9..802551e0009b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) > * PF_KTHREAD should already be set at this point; regardless, make it > * look like a proper per-CPU kthread. > */ > - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; > + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; I am concerned this will alter more than just the current task, which would mean more modifications later. There is a comment about it being called 'more than once' and 'per cpu' so I am hesitant to change the function itself. Although I am unsure of the call path.. fork_idle() -> init_idle() I guess? > kthread_set_per_cpu(idle, cpu); > > #ifdef CONFIG_SMP > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 342f58a329f5..5007b25c5bc6 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise); > > void cpu_startup_entry(enum cpuhp_state state) > { > + current->flags |= PF_IDLE; > arch_cpu_idle_prepare(); > cpuhp_online_idle(state); > while (1)
On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote: > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]: > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > > > > diff --git a/init/main.c b/init/main.c > > > index ad920fac325c..f74772acf612 100644 > > > --- a/init/main.c > > > +++ b/init/main.c > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > > */ > > > rcu_read_lock(); > > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > > - tsk->flags |= PF_NO_SETAFFINITY; > > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > > rcu_read_unlock(); > > > > > > > Hmm, isn't that pid-1 you're setting PF_IDLE on? > > Yes, thanks. I think that is what Geert is hitting with my patch. > > debug __might_resched() in kernel/sched/core.c is failing to return in > that first (complex) if statement. His report says pid 1 so this is > likely the issue. > > > > > The task becoming idle is 'current' at this point, see the > > cpu_startup_entry() call below. > > > > Would not something like so be the right thing? > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 2299a5cfbfb9..802551e0009b 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) > > * PF_KTHREAD should already be set at this point; regardless, make it > > * look like a proper per-CPU kthread. > > */ > > - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; > > + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; > > I am concerned this will alter more than just the current task, which > would mean more modifications later. There is a comment about it being > called 'more than once' and 'per cpu' so I am hesitant to change the > function itself. > > Although I am unsure of the call path.. fork_idle() -> init_idle() I > guess? There's only 2 ways to get into do_idle(), through cpu_startup_entry() and play_idle_precise(). The latter already frobs PF_IDLE since it is the forced idle path, this then leaves cpu_startup_entry() which is the regular idle path. All idle threads will end up calling into it, the boot CPU through the rest_init() and the SMP cpus through arch SMP bringup. IOW, this ensures all idle loops will have PF_IDLE set but not the pre-idle loop setup code these threads run.
* Peter Zijlstra <peterz@infradead.org> [230913 12:13]: > On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote: > > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]: > > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > > > > > > diff --git a/init/main.c b/init/main.c > > > > index ad920fac325c..f74772acf612 100644 > > > > --- a/init/main.c > > > > +++ b/init/main.c > > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > > > */ > > > > rcu_read_lock(); > > > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > > > - tsk->flags |= PF_NO_SETAFFINITY; > > > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > > > rcu_read_unlock(); > > > > > > > > > > Hmm, isn't that pid-1 you're setting PF_IDLE on? > > > > Yes, thanks. I think that is what Geert is hitting with my patch. > > > > debug __might_resched() in kernel/sched/core.c is failing to return in > > that first (complex) if statement. His report says pid 1 so this is > > likely the issue. > > > > > > > > The task becoming idle is 'current' at this point, see the > > > cpu_startup_entry() call below. > > > > > > Would not something like so be the right thing? > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 2299a5cfbfb9..802551e0009b 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) > > > * PF_KTHREAD should already be set at this point; regardless, make it > > > * look like a proper per-CPU kthread. > > > */ > > > - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; > > > + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; > > > > I am concerned this will alter more than just the current task, which > > would mean more modifications later. There is a comment about it being > > called 'more than once' and 'per cpu' so I am hesitant to change the > > function itself. > > > > Although I am unsure of the call path.. fork_idle() -> init_idle() I > > guess? > > There's only 2 ways to get into do_idle(), through cpu_startup_entry() > and play_idle_precise(). The latter already frobs PF_IDLE since it is > the forced idle path, this then leaves cpu_startup_entry() which is the > regular idle path. > > All idle threads will end up calling into it, the boot CPU through the > rest_init() and the SMP cpus through arch SMP bringup. > > IOW, this ensures all idle loops will have PF_IDLE set but not the > pre-idle loop setup code these threads run. Thanks for the information. This does leave the init_idle() function in the odd state of not setting PF_IDLE, but I guess that's okay?
On Wed, Sep 13, 2023 at 01:32:38PM -0400, Liam R. Howlett wrote: > * Peter Zijlstra <peterz@infradead.org> [230913 12:13]: > > On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote: > > > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]: > > > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote: > > > > > > > > > diff --git a/init/main.c b/init/main.c > > > > > index ad920fac325c..f74772acf612 100644 > > > > > --- a/init/main.c > > > > > +++ b/init/main.c > > > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) > > > > > */ > > > > > rcu_read_lock(); > > > > > tsk = find_task_by_pid_ns(pid, &init_pid_ns); > > > > > - tsk->flags |= PF_NO_SETAFFINITY; > > > > > + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; > > > > > set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); > > > > > rcu_read_unlock(); > > > > > > > > > > > > > Hmm, isn't that pid-1 you're setting PF_IDLE on? > > > > > > Yes, thanks. I think that is what Geert is hitting with my patch. > > > > > > debug __might_resched() in kernel/sched/core.c is failing to return in > > > that first (complex) if statement. His report says pid 1 so this is > > > likely the issue. > > > > > > > > > > > The task becoming idle is 'current' at this point, see the > > > > cpu_startup_entry() call below. > > > > > > > > Would not something like so be the right thing? > > > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index 2299a5cfbfb9..802551e0009b 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) > > > > * PF_KTHREAD should already be set at this point; regardless, make it > > > > * look like a proper per-CPU kthread. > > > > */ > > > > - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; > > > > + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; > > > > > > I am concerned this will alter more than just the current task, which > > > would mean more modifications later. There is a comment about it being > > > called 'more than once' and 'per cpu' so I am hesitant to change the > > > function itself. > > > > > > Although I am unsure of the call path.. fork_idle() -> init_idle() I > > > guess? > > > > There's only 2 ways to get into do_idle(), through cpu_startup_entry() > > and play_idle_precise(). The latter already frobs PF_IDLE since it is > > the forced idle path, this then leaves cpu_startup_entry() which is the > > regular idle path. > > > > All idle threads will end up calling into it, the boot CPU through the > > rest_init() and the SMP cpus through arch SMP bringup. > > > > IOW, this ensures all idle loops will have PF_IDLE set but not the > > pre-idle loop setup code these threads run. > > Thanks for the information. This does leave the init_idle() function in > the odd state of not setting PF_IDLE, but I guess that's okay? Yep, the few things that care about PF_IDLE seem to really only care about do_idle() and very much not (per the rcutiny thing) any code that comes before it.
Hello, kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c" on: commit: afa19b70494c22ba27224f90c0805e88e5db41e1 ("[PATCH] init/main: Clear boot task idle flag") url: https://github.com/intel-lab-lkp/linux/commits/Liam-R-Howlett/init-main-Clear-boot-task-idle-flag/20230913-085921 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3669558bdf354cd352be955ef2764cde6a9bf5ec patch link: https://lore.kernel.org/all/20230913005647.1534747-1-Liam.Howlett@oracle.com/ patch subject: [PATCH] init/main: Clear boot task idle flag in testcase: boot compiler: gcc-12 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) besides BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c, we also observed other issues on this commit but clean on parent: 3669558bdf354cd3 afa19b70494c22ba27224f90c08 ---------------- --------------------------- fail:runs %reproduction fail:runs | | | :6 17% 1:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/dma-buf/selftest.c :6 17% 1:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/net/ethernet/intel/e1000_hw.c :6 100% 6:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h :6 83% 5:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c :6 50% 3:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c :6 17% 1:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/rcu/tree.c :6 100% 6:6 dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c :6 100% 6:6 dmesg.RIP:__do_softirq :6 100% 6:6 dmesg.RIP:__flush_smp_call_function_queue :6 33% 2:6 dmesg.RIP:__handle_irq_event_percpu :6 100% 6:6 dmesg.RIP:__hrtimer_run_queues :6 100% 6:6 dmesg.RIP:__note_gp_changes :6 100% 6:6 dmesg.RIP:__queue_work :6 100% 6:6 dmesg.RIP:__raise_softirq_irqoff :6 100% 6:6 dmesg.RIP:__run_timers :6 100% 6:6 dmesg.RIP:__smp_call_single_queue :6 100% 6:6 dmesg.RIP:__sysvec_apic_timer_interrupt :6 100% 6:6 dmesg.RIP:__sysvec_call_function_single :6 50% 3:6 dmesg.RIP:__update_load_avg_blocked_se :6 100% 6:6 dmesg.RIP:__update_load_avg_cfs_rq :6 100% 6:6 dmesg.RIP:__update_load_avg_se :6 100% 6:6 dmesg.RIP:__wake_nocb_gp :6 67% 4:6 dmesg.RIP:_raw_spin_unlock_irqrestore :6 100% 6:6 dmesg.RIP:arch_ftrace_ops_list_func :6 100% 6:6 dmesg.RIP:attach_entity_load_avg :6 100% 6:6 dmesg.RIP:call_timer_fn :6 100% 6:6 dmesg.RIP:default_idle :6 100% 6:6 dmesg.RIP:dequeue_task_fair :6 100% 6:6 dmesg.RIP:detach_entity_load_avg :6 33% 2:6 dmesg.RIP:detach_if_pending :6 67% 4:6 dmesg.RIP:dma_fence_signal_timestamp_locked :6 33% 2:6 dmesg.RIP:do_nocb_deferred_wakeup :6 100% 6:6 dmesg.RIP:do_nocb_deferred_wakeup_timer :6 100% 6:6 dmesg.RIP:enqueue_hrtimer :6 100% 6:6 dmesg.RIP:enqueue_task_fair :6 100% 6:6 dmesg.RIP:enqueue_timer :6 17% 1:6 dmesg.RIP:hrtimer_init :6 67% 4:6 dmesg.RIP:load_balance :6 100% 6:6 dmesg.RIP:lock_acquire :6 100% 6:6 dmesg.RIP:lock_acquired :6 100% 6:6 dmesg.RIP:lock_contended :6 100% 6:6 dmesg.RIP:lock_release :6 33% 2:6 dmesg.RIP:printk_sprint :6 17% 1:6 dmesg.RIP:rcu_barrier_trace :6 100% 6:6 dmesg.RIP:rcu_core :6 100% 6:6 dmesg.RIP:rcu_is_cpu_rrupt_from_idle :6 83% 5:6 dmesg.RIP:rcu_qs :6 100% 6:6 dmesg.RIP:rcu_report_qs_rnp :6 100% 6:6 dmesg.RIP:rcu_sched_clock_irq :6 50% 3:6 dmesg.RIP:resched_curr :6 100% 6:6 dmesg.RIP:set_task_cpu :6 100% 6:6 dmesg.RIP:trace_hardirqs_off :6 100% 6:6 dmesg.RIP:trace_hardirqs_off_finish :6 100% 6:6 dmesg.RIP:trace_hardirqs_on :6 100% 6:6 dmesg.RIP:trace_reschedule_entry :6 100% 6:6 dmesg.RIP:trace_reschedule_exit :6 100% 6:6 dmesg.RIP:try_to_wake_up :6 100% 6:6 dmesg.RIP:ttwu_do_activate :6 100% 6:6 dmesg.RIP:update_cpu_capacity :6 100% 6:6 dmesg.RIP:update_curr :6 100% 6:6 dmesg.RIP:update_dl_rq_load_avg :6 100% 6:6 dmesg.RIP:update_irq_load_avg :6 100% 6:6 dmesg.RIP:update_rt_rq_load_avg :6 100% 6:6 dmesg.RIP:update_sd_lb_stats :6 17% 1:6 dmesg.RIP:wake_up_nohz_cpu :6 100% 6:6 dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_apic_timer_interrupt :6 100% 6:6 dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_call_function_single :6 100% 6:6 dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_entry :6 100% 6:6 dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_exit :6 100% 6:6 dmesg.WARNING:at_include/linux/trace_recursion.h:#arch_ftrace_ops_list_func :6 100% 6:6 dmesg.WARNING:at_include/trace/events/csd.h:#__flush_smp_call_function_queue :6 67% 4:6 dmesg.WARNING:at_include/trace/events/dma_fence.h:#dma_fence_signal_timestamp_locked :6 100% 6:6 dmesg.WARNING:at_include/trace/events/ipi.h:#__smp_call_single_queue :6 50% 3:6 dmesg.WARNING:at_include/trace/events/ipi.h:#resched_curr :6 17% 1:6 dmesg.WARNING:at_include/trace/events/ipi.h:#wake_up_nohz_cpu :6 100% 6:6 dmesg.WARNING:at_include/trace/events/irq.h:#__do_softirq :6 33% 2:6 dmesg.WARNING:at_include/trace/events/irq.h:#__handle_irq_event_percpu :6 100% 6:6 dmesg.WARNING:at_include/trace/events/irq.h:#__raise_softirq_irqoff :6 100% 6:6 dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquire :6 100% 6:6 dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquired :6 100% 6:6 dmesg.WARNING:at_include/trace/events/lock.h:#lock_contended :6 100% 6:6 dmesg.WARNING:at_include/trace/events/lock.h:#lock_release :6 100% 6:6 dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off :6 100% 6:6 dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off_finish :6 100% 6:6 dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_on :6 33% 2:6 dmesg.WARNING:at_include/trace/events/printk.h:#printk_sprint :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#__note_gp_changes :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#__wake_nocb_gp :6 33% 2:6 dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup_timer :6 17% 1:6 dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_barrier_trace :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_core :6 83% 5:6 dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_qs :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_report_qs_rnp :6 100% 6:6 dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_sched_clock_irq :6 50% 3:6 dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_blocked_se :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_cfs_rq :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_se :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#attach_entity_load_avg :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#dequeue_task_fair :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#detach_entity_load_avg :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#enqueue_task_fair :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#set_task_cpu :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#try_to_wake_up :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#ttwu_do_activate :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_cpu_capacity :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_curr :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_dl_rq_load_avg :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_irq_load_avg :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_rt_rq_load_avg :6 100% 6:6 dmesg.WARNING:at_include/trace/events/sched.h:#update_sd_lb_stats :6 100% 6:6 dmesg.WARNING:at_include/trace/events/timer.h:#__hrtimer_run_queues :6 100% 6:6 dmesg.WARNING:at_include/trace/events/timer.h:#__run_timers :6 100% 6:6 dmesg.WARNING:at_include/trace/events/timer.h:#call_timer_fn :6 33% 2:6 dmesg.WARNING:at_include/trace/events/timer.h:#detach_if_pending :6 100% 6:6 dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_hrtimer :6 100% 6:6 dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_timer :6 17% 1:6 dmesg.WARNING:at_include/trace/events/timer.h:#hrtimer_init :6 100% 6:6 dmesg.WARNING:at_include/trace/events/workqueue.h:#__queue_work :6 100% 6:6 dmesg.WARNING:at_kernel/rcu/tree.c:#rcu_is_cpu_rrupt_from_idle :6 17% 1:6 dmesg.WARNING:inconsistent_lock_state :6 100% 6:6 dmesg.WARNING:suspicious_RCU_usage :6 33% 2:6 dmesg.arch/x86/kernel/irq.c:#IRQ_failed_to_wake_up_RCU :6 100% 6:6 dmesg.include/linux/cgroup.h:#suspicious_rcu_dereference_check()usage :6 100% 6:6 dmesg.include/linux/rcupdate.h:#rcu_read_lock()used_illegally_while_idle :6 100% 6:6 dmesg.include/linux/rcupdate.h:#rcu_read_unlock()used_illegally_while_idle :6 17% 1:6 dmesg.inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage :6 100% 6:6 dmesg.kernel/events/core.c:#suspicious_rcu_dereference_check()usage :6 83% 5:6 dmesg.kernel/exit.c:#suspicious_rcu_dereference_check()usage :6 100% 6:6 dmesg.kernel/rcu/tree.c:#RCU_dynticks_nmi_nesting_counter_underflow/zero :6 67% 4:6 dmesg.kernel/sched/core.c:#suspicious_rcu_dereference_check()usage :6 100% 6:6 dmesg.kernel/sched/fair.c:#suspicious_rcu_dereference_check()usage :6 83% 5:6 dmesg.kernel/workqueue.c:#RCU_or_wq_pool_mutex_should_be_held :6 83% 5:6 dmesg.kernel/workqueue.c:#suspicious_rcu_dereference_check()usage If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202309142334.a345320b-oliver.sang@intel.com [ 2.318233][ T1] BUG: sleeping function called from invalid context at kernel/sched/completion.c:111 [ 2.318647][ T1] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 [ 2.319725][ T1] preempt_count: 0, expected: 0 [ 2.320293][ T1] no locks held by swapper/0/1. [ 2.320882][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc1-00034-gafa19b70494c #1 [ 2.321957][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.321957][ T1] Call Trace: [ 2.321957][ T1] <TASK> [ 2.321957][ T1] dump_stack_lvl (lib/dump_stack.c:107) [ 2.321957][ T1] __might_resched (kernel/sched/core.c:10188) [ 2.321957][ T1] ? sched_core_find (kernel/sched/core.c:10142) [ 2.321957][ T1] __wait_for_common (kernel/sched/completion.c:115) [ 2.321957][ T1] ? usleep_range_state (kernel/time/timer.c:2129) [ 2.321957][ T1] ? reacquire_held_locks (kernel/locking/lockdep.c:5404) [ 2.321957][ T1] ? wait_for_completion_io_timeout (kernel/sched/completion.c:110) [ 2.321957][ T1] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:121) [ 2.321957][ T1] ? rest_init (init/main.c:1431) [ 2.321957][ T1] kernel_init (init/main.c:1439) [ 2.321957][ T1] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) [ 2.321957][ T1] ret_from_fork (arch/x86/kernel/process.c:153) [ 2.321957][ T1] ? rest_init (init/main.c:1431) [ 2.321957][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 2.321957][ T1] </TASK> [ 2.322276][ T1] smpboot: CPU0: Intel Xeon E312xx (Sandy Bridge) (family: 0x6, model: 0x2a, stepping: 0x1) [ 2.327212][ T1] RCU Tasks: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1. [ 2.328461][ T1] RCU Tasks Rude: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1. [ 2.330045][ T1] RCU Tasks Trace: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1. [ 2.331385][ T1] Running RCU-tasks wait API self tests [ 2.333117][ T1] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only. [ 2.334654][ T1] signal: max sigframe size: 1776 [ 2.335915][ T1] rcu: Hierarchical SRCU implementation. [ 2.336653][ T1] rcu: Max phase no-delay instances is 1000. [ 2.342550][ T14] Callback from call_rcu_tasks_trace() invoked. [ 2.344299][ T1] smp: Bringing up secondary CPUs ... [ 2.347251][ T1] smpboot: x86: Booting SMP configuration: [ 2.347999][ T1] .... node #0, CPUs: #1 [ 0.487313][ T0] masked ExtINT on CPU#1 [ 2.348624][ T0] ------------[ cut here ]------------ [ 2.348624][ T0] WARNING: CPU: 0 PID: 0 at include/trace/events/preemptirq.h:36 trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) [ 2.348624][ T0] Modules linked in: [ 2.348624][ T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.6.0-rc1-00034-gafa19b70494c #1 [ 2.348624][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.348624][ T0] RIP: 0010:trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) [ 2.348624][ T0] Code: 03 80 3c 02 00 75 2a 48 8b 7b 08 31 d2 48 89 ee e8 99 f4 ff ff 65 ff 0d 0a 69 ce 7e e9 5d ff ff ff e8 38 18 e9 ff 84 c0 75 80 <0f> 0b e9 79 ff ff ff e8 c8 c0 3a 00 eb cf 66 0f 1f 44 00 00 65 8b All code ======== 0: 03 80 3c 02 00 75 add 0x7500023c(%rax),%eax 6: 2a 48 8b sub -0x75(%rax),%cl 9: 7b 08 jnp 0x13 b: 31 d2 xor %edx,%edx d: 48 89 ee mov %rbp,%rsi 10: e8 99 f4 ff ff call 0xfffffffffffff4ae 15: 65 ff 0d 0a 69 ce 7e decl %gs:0x7ece690a(%rip) # 0x7ece6926 1c: e9 5d ff ff ff jmp 0xffffffffffffff7e 21: e8 38 18 e9 ff call 0xffffffffffe9185e 26: 84 c0 test %al,%al 28: 75 80 jne 0xffffffffffffffaa 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 79 ff ff ff jmp 0xffffffffffffffaa 31: e8 c8 c0 3a 00 call 0x3ac0fe 36: eb cf jmp 0x7 38: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 3e: 65 gs 3f: 8b .byte 0x8b Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 79 ff ff ff jmp 0xffffffffffffff80 7: e8 c8 c0 3a 00 call 0x3ac0d4 c: eb cf jmp 0xffffffffffffffdd e: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 14: 65 gs 15: 8b .byte 0x8b [ 2.348624][ T0] RSP: 0000:ffffffff84a07d60 EFLAGS: 00010046 [ 2.348624][ T0] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81513ac2 [ 2.348624][ T0] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff85a79540 [ 2.348624][ T0] RBP: ffffffff8352c418 R08: 0000000000000000 R09: fffffbfff0b4f2a8 [ 2.348624][ T0] R10: ffffffff85a79547 R11: 0000000000000000 R12: 0000000000000000 [ 2.348624][ T0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.348624][ T0] FS: 0000000000000000(0000) GS:ffff8883aea00000(0000) knlGS:0000000000000000 [ 2.348624][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.348624][ T0] CR2: ffff88843ffff000 CR3: 0000000004a89000 CR4: 00000000000406f0 [ 2.348624][ T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2.348624][ T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2.348624][ T0] Call Trace: [ 2.348624][ T0] <TASK> [ 2.348624][ T0] ? __warn (kernel/panic.c:673) [ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) [ 2.348624][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) [ 2.348624][ T0] ? handle_bug (arch/x86/kernel/traps.c:237) [ 2.348624][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 2.348624][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 2.348624][ T0] ? irqentry_enter (kernel/entry/common.c:315) [ 2.348624][ T0] ? trace_hardirqs_off_finish (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1082 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) [ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) [ 2.348624][ T0] irqentry_enter (kernel/entry/common.c:315) [ 2.348624][ T0] sysvec_call_function_single (arch/x86/kernel/smp.c:287) [ 2.348624][ T0] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652) [ 2.348624][ T0] RIP: 0010:default_idle (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/kernel/process.c:750) [ 2.348624][ T0] Code: 07 76 e7 48 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc eb 07 0f 00 2d 37 79 30 00 fb f4 <fa> c3 0f 1f 00 65 48 8b 04 25 40 a4 1f 00 f0 80 48 02 20 48 8b 10 All code ======== 0: 07 (bad) 1: 76 e7 jbe 0xffffffffffffffea 3: 48 89 07 mov %rax,(%rdi) 6: 49 c7 c0 08 00 00 00 mov $0x8,%r8 d: 4d 29 c8 sub %r9,%r8 10: 4c 01 c7 add %r8,%rdi 13: 4c 29 c2 sub %r8,%rdx 16: e9 72 ff ff ff jmp 0xffffffffffffff8d 1b: cc int3 1c: cc int3 1d: cc int3 1e: cc int3 1f: eb 07 jmp 0x28 21: 0f 00 2d 37 79 30 00 verw 0x307937(%rip) # 0x30795f 28: fb sti 29: f4 hlt 2a:* fa cli <-- trapping instruction 2b: c3 ret 2c: 0f 1f 00 nopl (%rax) 2f: 65 48 8b 04 25 40 a4 mov %gs:0x1fa440,%rax 36: 1f 00 38: f0 80 48 02 20 lock orb $0x20,0x2(%rax) 3d: 48 8b 10 mov (%rax),%rdx Code starting with the faulting instruction =========================================== 0: fa cli 1: c3 ret 2: 0f 1f 00 nopl (%rax) 5: 65 48 8b 04 25 40 a4 mov %gs:0x1fa440,%rax c: 1f 00 e: f0 80 48 02 20 lock orb $0x20,0x2(%rax) 13: 48 8b 10 mov (%rax),%rdx [ 2.348624][ T0] RSP: 0000:ffffffff84a07e40 EFLAGS: 00000206 [ 2.348624][ T0] RAX: 0000000000003056 RBX: 0000000000000000 RCX: ffffffff8352c7ba [ 2.348624][ T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812e3bd3 [ 2.348624][ T0] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1075d7ed9d [ 2.348624][ T0] R10: ffff8883aebf6ceb R11: 0000000000000000 R12: 1ffffffff0940fca [ 2.348624][ T0] R13: ffffffff84a1e940 R14: dffffc0000000000 R15: 0000000000013f10 [ 2.348624][ T0] ? ct_kernel_exit+0xca/0x100 [ 2.348624][ T0] ? cpuidle_idle_call (kernel/sched/idle.c:171) [ 2.348624][ T0] default_idle_call (include/linux/cpuidle.h:143 kernel/sched/idle.c:98) [ 2.348624][ T0] cpuidle_idle_call (kernel/sched/idle.c:171) [ 2.348624][ T0] ? arch_cpu_idle_exit+0x30/0x30 [ 2.348624][ T0] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) [ 2.348624][ T0] do_idle (kernel/sched/idle.c:282) [ 2.348624][ T0] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1)) [ 2.348624][ T0] rest_init (include/linux/rcupdate.h:750 init/main.c:697) [ 2.348624][ T0] arch_call_rest_init+0xf/0x20 [ 2.348624][ T0] start_kernel (init/main.c:1021 (discriminator 1)) [ 2.348624][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:544) [ 2.348624][ T0] x86_64_start_kernel (??:?) [ 2.348624][ T0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) [ 2.348624][ T0] </TASK> [ 2.348624][ T0] irq event stamp: 12375 [ 2.348624][ T0] hardirqs last enabled at (12374): cpuidle_idle_call (kernel/sched/idle.c:171) [ 2.348624][ T0] hardirqs last disabled at (12375): sysvec_call_function_single (arch/x86/kernel/smp.c:287) [ 2.348624][ T0] softirqs last enabled at (8563): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) [ 2.348624][ T0] softirqs last disabled at (8558): __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632) [ 2.348624][ T0] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20230914/202309142334.a345320b-oliver.sang@intel.com
diff --git a/init/main.c b/init/main.c index ad920fac325c..f74772acf612 100644 --- a/init/main.c +++ b/init/main.c @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) */ rcu_read_lock(); tsk = find_task_by_pid_ns(pid, &init_pid_ns); - tsk->flags |= PF_NO_SETAFFINITY; + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); rcu_read_unlock(); @@ -938,6 +938,8 @@ void start_kernel(void) * time - but meanwhile we still have a functioning scheduler. */ sched_init(); + /* Avoid early context switch, rest_init() restores PF_IDLE */ + current->flags &= ~PF_IDLE; if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n"))
Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by clearing the PF_IDLE flag on return from sched_init() and restore the flag in rest_init(). Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Cc: stable@vger.kernel.org Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> --- init/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)