diff mbox series

[1/5] fs: gate final fput task_work on PF_NO_TASKWORK

Message ID 20250409134057.198671-2-axboe@kernel.dk (mailing list archive)
State New
Headers show
Series Cancel and wait for all requests on exit | expand

Commit Message

Jens Axboe April 9, 2025, 1:35 p.m. UTC
fput currently gates whether or not a task can run task_work on the
PF_KTHREAD flag, which excludes kernel threads as they don't usually run
task_work as they never exit to userspace. This punts the final fput
done from a kthread to a delayed work item instead of using task_work.

It's perfectly viable to have the final fput done by the kthread itself,
as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
which is set by default by a kernel thread, and gate the task_work fput
on that instead. This enables a kernel thread to clear this flag
temporarily while putting files, as long as it runs its task_work
manually.

This enables users like io_uring to ensure that when the final fput of a
file is done as part of ring teardown to run the local task_work and
hence know that all files have been properly put, without needing to
resort to workqueue flushing tricks which can deadlock.

No functional changes in this patch.

Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file_table.c       | 2 +-
 include/linux/sched.h | 2 +-
 kernel/fork.c         | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Comments

Christian Brauner April 11, 2025, 1:48 p.m. UTC | #1
On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
> fput currently gates whether or not a task can run task_work on the
> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
> task_work as they never exit to userspace. This punts the final fput
> done from a kthread to a delayed work item instead of using task_work.
> 
> It's perfectly viable to have the final fput done by the kthread itself,
> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
> which is set by default by a kernel thread, and gate the task_work fput
> on that instead. This enables a kernel thread to clear this flag
> temporarily while putting files, as long as it runs its task_work
> manually.
> 
> This enables users like io_uring to ensure that when the final fput of a
> file is done as part of ring teardown to run the local task_work and
> hence know that all files have been properly put, without needing to
> resort to workqueue flushing tricks which can deadlock.
> 
> No functional changes in this patch.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

Seems fine. Although it has some potential for abuse. So maybe a
VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
would make sense.

Acked-by: Christian Brauner <brauner@kernel.org>

>  fs/file_table.c       | 2 +-
>  include/linux/sched.h | 2 +-
>  kernel/fork.c         | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index c04ed94cdc4b..e3c3dd1b820d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>  		return;
>  	}
>  
> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>  		init_task_work(&file->f_task_work, ____fput);
>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>  			return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..349c993fc32b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b..8dd0b8a5348d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>  		goto fork_out;
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
> -		p->flags |= PF_KTHREAD;
> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>  	if (args->user_worker) {
>  		/*
>  		 * Mark us a user worker, and block any signal that isn't
> -- 
> 2.49.0
>
Jens Axboe April 11, 2025, 2:37 p.m. UTC | #2
On 4/11/25 7:48 AM, Christian Brauner wrote:
> Seems fine. Although it has some potential for abuse. So maybe a
> VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
> would make sense.

Can certainly add that. You'd want that before the check for
in_interrupt and PF_NO_TASKWORK? Something ala

	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));

?

> Acked-by: Christian Brauner <brauner@kernel.org>

Thanks!
Christian Brauner April 14, 2025, 10:10 a.m. UTC | #3
On Fri, Apr 11, 2025 at 08:37:51AM -0600, Jens Axboe wrote:
> On 4/11/25 7:48 AM, Christian Brauner wrote:
> > Seems fine. Although it has some potential for abuse. So maybe a
> > VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
> > would make sense.
> 
> Can certainly add that. You'd want that before the check for
> in_interrupt and PF_NO_TASKWORK? Something ala
> 
> 	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
> 	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));
> 
> ?

Yeah, sounds good!
Jens Axboe April 14, 2025, 2:29 p.m. UTC | #4
On 4/14/25 4:10 AM, Christian Brauner wrote:
> On Fri, Apr 11, 2025 at 08:37:51AM -0600, Jens Axboe wrote:
>> On 4/11/25 7:48 AM, Christian Brauner wrote:
>>> Seems fine. Although it has some potential for abuse. So maybe a
>>> VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
>>> would make sense.
>>
>> Can certainly add that. You'd want that before the check for
>> in_interrupt and PF_NO_TASKWORK? Something ala
>>
>> 	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
>> 	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));
>>
>> ?
> 
> Yeah, sounds good!

I used the usual XOR trick for this kind of test, but placed in the same
spot:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-exit-cancel.2&id=d5ab108781ccc2f0f013fe009a010a1f29a4785d
Mateusz Guzik April 14, 2025, 5:11 p.m. UTC | #5
On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
> fput currently gates whether or not a task can run task_work on the
> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
> task_work as they never exit to userspace. This punts the final fput
> done from a kthread to a delayed work item instead of using task_work.
> 
> It's perfectly viable to have the final fput done by the kthread itself,
> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
> which is set by default by a kernel thread, and gate the task_work fput
> on that instead. This enables a kernel thread to clear this flag
> temporarily while putting files, as long as it runs its task_work
> manually.
> 
> This enables users like io_uring to ensure that when the final fput of a
> file is done as part of ring teardown to run the local task_work and
> hence know that all files have been properly put, without needing to
> resort to workqueue flushing tricks which can deadlock.
> 
> No functional changes in this patch.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/file_table.c       | 2 +-
>  include/linux/sched.h | 2 +-
>  kernel/fork.c         | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index c04ed94cdc4b..e3c3dd1b820d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>  		return;
>  	}
>  
> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>  		init_task_work(&file->f_task_work, ____fput);
>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>  			return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..349c993fc32b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b..8dd0b8a5348d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>  		goto fork_out;
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
> -		p->flags |= PF_KTHREAD;
> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>  	if (args->user_worker) {
>  		/*
>  		 * Mark us a user worker, and block any signal that isn't

I don't have comments on the semantics here, I do have comments on some
future-proofing.

To my reading kthreads on the stock kernel never execute task_work.

This suggests it would be nice for task_work_add() to at least WARN_ON
when executing with a kthread. After all you don't want a task_work_add
consumer adding work which will never execute.

But then for your patch to not produce any splats there would have to be
a flag blessing select kthreads as legitimate task_work consumers.

So my suggestion would be to add the WARN_ON() in task_work_add() prior
to anything in this patchset, then this patch would be extended with a
flag (PF_KTHREAD_DOES_TASK_WORK?) and relevant io_uring threads would
get the flag.

Then the machinery which sets/unsets PF_NO_TASKWORK can assert that:
1. it operates on a kthread...
2. ...with the PF_KTHREAD_DOES_TASK_WORK flag

This is just a suggestion though.
Jens Axboe April 14, 2025, 7:35 p.m. UTC | #6
On 4/14/25 11:11 AM, Mateusz Guzik wrote:
> On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
>> fput currently gates whether or not a task can run task_work on the
>> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
>> task_work as they never exit to userspace. This punts the final fput
>> done from a kthread to a delayed work item instead of using task_work.
>>
>> It's perfectly viable to have the final fput done by the kthread itself,
>> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
>> which is set by default by a kernel thread, and gate the task_work fput
>> on that instead. This enables a kernel thread to clear this flag
>> temporarily while putting files, as long as it runs its task_work
>> manually.
>>
>> This enables users like io_uring to ensure that when the final fput of a
>> file is done as part of ring teardown to run the local task_work and
>> hence know that all files have been properly put, without needing to
>> resort to workqueue flushing tricks which can deadlock.
>>
>> No functional changes in this patch.
>>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/file_table.c       | 2 +-
>>  include/linux/sched.h | 2 +-
>>  kernel/fork.c         | 2 +-
>>  3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index c04ed94cdc4b..e3c3dd1b820d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>>  		return;
>>  	}
>>  
>> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
>> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>>  		init_task_work(&file->f_task_work, ____fput);
>>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>>  			return;
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f96ac1982893..349c993fc32b 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>>  						 * I am cleaning dirty pages from some other bdi. */
>>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>> -#define PF__HOLE__00800000	0x00800000
>> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>>  #define PF__HOLE__01000000	0x01000000
>>  #define PF__HOLE__02000000	0x02000000
>>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index c4b26cd8998b..8dd0b8a5348d 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>>  		goto fork_out;
>>  	p->flags &= ~PF_KTHREAD;
>>  	if (args->kthread)
>> -		p->flags |= PF_KTHREAD;
>> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>>  	if (args->user_worker) {
>>  		/*
>>  		 * Mark us a user worker, and block any signal that isn't
> 
> I don't have comments on the semantics here, I do have comments on some
> future-proofing.
> 
> To my reading kthreads on the stock kernel never execute task_work.

Correct

> This suggests it would be nice for task_work_add() to at least WARN_ON
> when executing with a kthread. After all you don't want a task_work_add
> consumer adding work which will never execute.

I don't think there's much need for that, as I'm not aware of any kernel
usage that had a bug due to that. And if you did, you'd find it pretty
quick during testing as that work would just never execute.

> But then for your patch to not produce any splats there would have to be
> a flag blessing select kthreads as legitimate task_work consumers.

This patchset very much adds a specific flag for that, PF_NO_TASKWORK,
and kernel threads have it set by default. It just separates the "do I
run task_work" flag from PF_KTHREAD. So yes you could add:

WARN_ON_ONCE(task->flags & PF_NO_TASKWORK);

to task_work_add(), but I'm not really convinced it'd be super useful.
diff mbox series

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..e3c3dd1b820d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -521,7 +521,7 @@  static void __fput_deferred(struct file *file)
 		return;
 	}
 
-	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
 		init_task_work(&file->f_task_work, ____fput);
 		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
 			return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..349c993fc32b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1736,7 +1736,7 @@  extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..8dd0b8a5348d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2261,7 +2261,7 @@  __latent_entropy struct task_struct *copy_process(
 		goto fork_out;
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
-		p->flags |= PF_KTHREAD;
+		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
 	if (args->user_worker) {
 		/*
 		 * Mark us a user worker, and block any signal that isn't