Message ID | b46cca62b5ebd131d0a744572e57f5ebc53d5ef4.1737511963.git.jpoimboe@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | unwind, perf: sframe user space unwinding | expand |
On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote: > It's possible for irq_work_queue() to fail if the work has already been > claimed. That can happen if a TWA_NMI_CURRENT task work is requested > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a > chance to run. I'm confused, if it fails then it's already pending, and we'll get the notification already. You can still add the work. > The error has to be checked before the write to task->task_works. Also > the try_cmpxchg() loop isn't needed in NMI context. The TWA_NMI_CURRENT > case really is special, keep things simple by keeping its code all > together in one place. NMIs can nest, consider #DB (which is NMI like) doing task_work_add() and getting interrupted with NMI doing the same. Might all this be fallout from trying to fix that schedule() bug from the next patch, because as is, I don't see it.
On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote: > On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote: > > It's possible for irq_work_queue() to fail if the work has already been > > claimed. That can happen if a TWA_NMI_CURRENT task work is requested > > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a > > chance to run. > > I'm confused, if it fails then it's already pending, and we'll get the > notification already. You can still add the work. Yeah, I suppose that makes sense. If the pending irq_work is already going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again. > > The error has to be checked before the write to task->task_works. Also > > the try_cmpxchg() loop isn't needed in NMI context. The TWA_NMI_CURRENT > > case really is special, keep things simple by keeping its code all > > together in one place. > > NMIs can nest, Just for my understanding: for nested NMIs, the entry code basically queues up the next NMI, so the C handler (exc_nmi) can't nest. Right? > consider #DB (which is NMI like) What exactly do you mean by "NMI like"? Is it because a #DB might be basically running in NMI context, if the NMI hit a breakpoint? > doing task_work_add() and getting interrupted with NMI doing the same. How exactly would that work? At least with my patch the #DB wouldn't be able to use TWA_NMI_CURRENT unless in_nmi() were true due to NMI hitting a breakpoint. In which case a nested NMI wouldn't actually nest, it would get "queued" by the entry code. But yeah, I do see how the reverse can be true: somebody sets a breakpoint in task_work, right where it's fiddling with the list head. NMI calls task_work_add(TWA_NMI_CURRENT), triggering the #DB, which also calls task_work_add().
On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote: > On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote: > > > It's possible for irq_work_queue() to fail if the work has already been > > > claimed. That can happen if a TWA_NMI_CURRENT task work is requested > > > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a > > > chance to run. > > > > I'm confused, if it fails then it's already pending, and we'll get the > > notification already. You can still add the work. > > Yeah, I suppose that makes sense. If the pending irq_work is already > going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again. > > > > The error has to be checked before the write to task->task_works. Also > > > the try_cmpxchg() loop isn't needed in NMI context. The TWA_NMI_CURRENT > > > case really is special, keep things simple by keeping its code all > > > together in one place. > > > > NMIs can nest, > > Just for my understanding: for nested NMIs, the entry code basically > queues up the next NMI, so the C handler (exc_nmi) can't nest. Right? > > > consider #DB (which is NMI like) > > What exactly do you mean by "NMI like"? Is it because a #DB might be > basically running in NMI context, if the NMI hit a breakpoint? No, #DB, #BP and such are considered NMI (and will have in_nmi() true) because they can trigger anywhere, including sections where IRQs are disabled. > > doing task_work_add() and getting interrupted with NMI doing the same. > > How exactly would that work? At least with my patch the #DB wouldn't be > able to use TWA_NMI_CURRENT unless in_nmi() were true It is, see exc_debug_kernel() doing irqentry_nmi_enter().
On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote: > On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote: > > What exactly do you mean by "NMI like"? Is it because a #DB might be > > basically running in NMI context, if the NMI hit a breakpoint? > > No, #DB, #BP and such are considered NMI (and will have in_nmi() true) > because they can trigger anywhere, including sections where IRQs are > disabled. So: - while exceptions are technically not NMI, they're "NMI" because they can occur in NMI or IRQ-disabled regions - such "NMI" exceptions can be preempted by NMIs and "NMIs" - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!) ... did I get all that right? Not subtle at all! I feel like in_nmi() needs a comment explaining all that nonobviousness.
On Thu, Jan 23, 2025 at 09:15:09AM -0800, Josh Poimboeuf wrote: > On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote: > > On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote: > > > What exactly do you mean by "NMI like"? Is it because a #DB might be > > > basically running in NMI context, if the NMI hit a breakpoint? > > > > No, #DB, #BP and such are considered NMI (and will have in_nmi() true) > > because they can trigger anywhere, including sections where IRQs are > > disabled. > > So: > > - while exceptions are technically not NMI, they're "NMI" because they > can occur in NMI or IRQ-disabled regions > > - such "NMI" exceptions can be preempted by NMIs and "NMIs" > > - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!) > > ... did I get all that right? Not subtle at all! Yeah, sounds about right :-)
diff --git a/kernel/task_work.c b/kernel/task_work.c index c969f1f26be5..92024a8bfe12 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -58,25 +58,38 @@ int task_work_add(struct task_struct *task, struct callback_head *work, int flags = notify & TWA_FLAGS; notify &= ~TWA_FLAGS; + if (notify == TWA_NMI_CURRENT) { - if (WARN_ON_ONCE(task != current)) + if (WARN_ON_ONCE(!in_nmi() || task != current)) return -EINVAL; if (!IS_ENABLED(CONFIG_IRQ_WORK)) return -EINVAL; - } else { - /* - * Record the work call stack in order to print it in KASAN - * reports. - * - * Note that stack allocation can fail if TWAF_NO_ALLOC flag - * is set and new page is needed to expand the stack buffer. - */ - if (flags & TWAF_NO_ALLOC) - kasan_record_aux_stack_noalloc(work); - else - kasan_record_aux_stack(work); +#ifdef CONFIG_IRQ_WORK + head = task->task_works; + if (unlikely(head == &work_exited)) + return -ESRCH; + + if (!irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume))) + return -EBUSY; + + work->next = head; + task->task_works = work; +#endif + return 0; } + /* + * Record the work call stack in order to print it in KASAN + * reports. + * + * Note that stack allocation can fail if TWAF_NO_ALLOC flag + * is set and new page is needed to expand the stack buffer. + */ + if (flags & TWAF_NO_ALLOC) + kasan_record_aux_stack_noalloc(work); + else + kasan_record_aux_stack(work); + head = READ_ONCE(task->task_works); do { if (unlikely(head == &work_exited))
It's possible for irq_work_queue() to fail if the work has already been claimed. That can happen if a TWA_NMI_CURRENT task work is requested before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a chance to run. The error has to be checked before the write to task->task_works. Also the try_cmpxchg() loop isn't needed in NMI context. The TWA_NMI_CURRENT case really is special, keep things simple by keeping its code all together in one place. Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> --- kernel/task_work.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)