diff mbox series

[v4,01/39] task_work: Fix TWA_NMI_CURRENT error handling

Message ID b46cca62b5ebd131d0a744572e57f5ebc53d5ef4.1737511963.git.jpoimboe@kernel.org (mailing list archive)
State New
Headers show
Series unwind, perf: sframe user space unwinding | expand

Commit Message

Josh Poimboeuf Jan. 22, 2025, 2:30 a.m. UTC
It's possible for irq_work_queue() to fail if the work has already been
claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
chance to run.

The error has to be checked before the write to task->task_works.  Also
the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
case really is special, keep things simple by keeping its code all
together in one place.

Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/task_work.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

Comments

Peter Zijlstra Jan. 22, 2025, 12:28 p.m. UTC | #1
On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> It's possible for irq_work_queue() to fail if the work has already been
> claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> chance to run.

I'm confused, if it fails then it's already pending, and we'll get the
notification already. You can still add the work.

> The error has to be checked before the write to task->task_works.  Also
> the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> case really is special, keep things simple by keeping its code all
> together in one place.

NMIs can nest, consider #DB (which is NMI like) doing task_work_add()
and getting interrupted with NMI doing the same.


Might all this be fallout from trying to fix that schedule() bug from
the next patch, because as is, I don't see it.
Josh Poimboeuf Jan. 22, 2025, 8:47 p.m. UTC | #2
On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> > It's possible for irq_work_queue() to fail if the work has already been
> > claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> > chance to run.
> 
> I'm confused, if it fails then it's already pending, and we'll get the
> notification already. You can still add the work.

Yeah, I suppose that makes sense.  If the pending irq_work is already
going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again.

> > The error has to be checked before the write to task->task_works.  Also
> > the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> > case really is special, keep things simple by keeping its code all
> > together in one place.
> 
> NMIs can nest,

Just for my understanding: for nested NMIs, the entry code basically
queues up the next NMI, so the C handler (exc_nmi) can't nest.  Right?

> consider #DB (which is NMI like)

What exactly do you mean by "NMI like"?  Is it because a #DB might be
basically running in NMI context, if the NMI hit a breakpoint?

> doing task_work_add() and getting interrupted with NMI doing the same.

How exactly would that work?  At least with my patch the #DB wouldn't be
able to use TWA_NMI_CURRENT unless in_nmi() were true due to NMI hitting
a breakpoint.  In which case a nested NMI wouldn't actually nest, it
would get "queued" by the entry code.

But yeah, I do see how the reverse can be true: somebody sets a
breakpoint in task_work, right where it's fiddling with the list head.
NMI calls task_work_add(TWA_NMI_CURRENT), triggering the #DB, which also
calls task_work_add().
Peter Zijlstra Jan. 23, 2025, 8:14 a.m. UTC | #3
On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> > > It's possible for irq_work_queue() to fail if the work has already been
> > > claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> > > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> > > chance to run.
> > 
> > I'm confused, if it fails then it's already pending, and we'll get the
> > notification already. You can still add the work.
> 
> Yeah, I suppose that makes sense.  If the pending irq_work is already
> going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again.
> 
> > > The error has to be checked before the write to task->task_works.  Also
> > > the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> > > case really is special, keep things simple by keeping its code all
> > > together in one place.
> > 
> > NMIs can nest,
> 
> Just for my understanding: for nested NMIs, the entry code basically
> queues up the next NMI, so the C handler (exc_nmi) can't nest.  Right?
> 
> > consider #DB (which is NMI like)
> 
> What exactly do you mean by "NMI like"?  Is it because a #DB might be
> basically running in NMI context, if the NMI hit a breakpoint?

No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
because they can trigger anywhere, including sections where IRQs are
disabled.

> > doing task_work_add() and getting interrupted with NMI doing the same.
> 
> How exactly would that work?  At least with my patch the #DB wouldn't be
> able to use TWA_NMI_CURRENT unless in_nmi() were true

It is, see exc_debug_kernel() doing irqentry_nmi_enter().
Josh Poimboeuf Jan. 23, 2025, 5:15 p.m. UTC | #4
On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> > What exactly do you mean by "NMI like"?  Is it because a #DB might be
> > basically running in NMI context, if the NMI hit a breakpoint?
> 
> No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
> because they can trigger anywhere, including sections where IRQs are
> disabled.

So:

  - while exceptions are technically not NMI, they're "NMI" because they
    can occur in NMI or IRQ-disabled regions

  - such "NMI" exceptions can be preempted by NMIs and "NMIs"

  - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!)

... did I get all that right?  Not subtle at all!

I feel like in_nmi() needs a comment explaining all that nonobviousness.
Peter Zijlstra Jan. 23, 2025, 10:19 p.m. UTC | #5
On Thu, Jan 23, 2025 at 09:15:09AM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> > > What exactly do you mean by "NMI like"?  Is it because a #DB might be
> > > basically running in NMI context, if the NMI hit a breakpoint?
> > 
> > No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
> > because they can trigger anywhere, including sections where IRQs are
> > disabled.
> 
> So:
> 
>   - while exceptions are technically not NMI, they're "NMI" because they
>     can occur in NMI or IRQ-disabled regions
> 
>   - such "NMI" exceptions can be preempted by NMIs and "NMIs"
> 
>   - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!)
> 
> ... did I get all that right?  Not subtle at all!

Yeah, sounds about right :-)
diff mbox series

Patch

diff --git a/kernel/task_work.c b/kernel/task_work.c
index c969f1f26be5..92024a8bfe12 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -58,25 +58,38 @@  int task_work_add(struct task_struct *task, struct callback_head *work,
 	int flags = notify & TWA_FLAGS;
 
 	notify &= ~TWA_FLAGS;
+
 	if (notify == TWA_NMI_CURRENT) {
-		if (WARN_ON_ONCE(task != current))
+		if (WARN_ON_ONCE(!in_nmi() || task != current))
 			return -EINVAL;
 		if (!IS_ENABLED(CONFIG_IRQ_WORK))
 			return -EINVAL;
-	} else {
-		/*
-		 * Record the work call stack in order to print it in KASAN
-		 * reports.
-		 *
-		 * Note that stack allocation can fail if TWAF_NO_ALLOC flag
-		 * is set and new page is needed to expand the stack buffer.
-		 */
-		if (flags & TWAF_NO_ALLOC)
-			kasan_record_aux_stack_noalloc(work);
-		else
-			kasan_record_aux_stack(work);
+#ifdef CONFIG_IRQ_WORK
+		head = task->task_works;
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
+
+		if (!irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume)))
+			return -EBUSY;
+
+		work->next = head;
+		task->task_works = work;
+#endif
+		return 0;
 	}
 
+	/*
+	 * Record the work call stack in order to print it in KASAN
+	 * reports.
+	 *
+	 * Note that stack allocation can fail if TWAF_NO_ALLOC flag
+	 * is set and new page is needed to expand the stack buffer.
+	 */
+	if (flags & TWAF_NO_ALLOC)
+		kasan_record_aux_stack_noalloc(work);
+	else
+		kasan_record_aux_stack(work);
+
 	head = READ_ONCE(task->task_works);
 	do {
 		if (unlikely(head == &work_exited))