diff mbox series

[v3,2/2] io_uring/io-wq: inherit cpuset of cgroup in io worker

Message ID 20240910171157.166423-3-felix.moessbauer@siemens.com (mailing list archive)
State New
Headers show
Series io_uring/io-wq: respect cgroup cpusets | expand

Commit Message

Felix Moessbauer Sept. 10, 2024, 5:11 p.m. UTC
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).

When creating a new io worker, this worker should inherit the cpuset
of the cgroup.

Fixes: da64d6db3bd3 ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
---
 io_uring/io-wq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Waiman Long Sept. 10, 2024, 5:42 p.m. UTC | #1
On 9/10/24 13:11, Felix Moessbauer wrote:
> The io worker threads are userland threads that just never exit to the
> userland. By that, they are also assigned to a cgroup (the group of the
> creating task).

The io-wq task is not actually assigned to a cgroup. To belong to a 
cgroup, its pid has to be present to the cgroup.procs of the 
corresponding cgroup, which is not the case here. My understanding is 
that you are just restricting the CPU affinity to follow the cpuset of 
the corresponding user task that creates it. The CPU affinity (cpumask) 
is just one of the many resources controlled by a cgroup. That probably 
needs to be clarified.

Besides cpumask, the cpuset controller also controls the node mask of 
the memory nodes allowed.

Cheers,
Longman

>
> When creating a new io worker, this worker should inherit the cpuset
> of the cgroup.
>
> Fixes: da64d6db3bd3 ("io_uring: One wqe per wq")
> Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
> ---
>   io_uring/io-wq.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index c7055a8895d7..a38f36b68060 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -1168,7 +1168,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
>   
>   	if (!alloc_cpumask_var(&wq->cpu_mask, GFP_KERNEL))
>   		goto err;
> -	cpumask_copy(wq->cpu_mask, cpu_possible_mask);
> +	cpuset_cpus_allowed(data->task, wq->cpu_mask);
>   	wq->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
>   	wq->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>   				task_rlimit(current, RLIMIT_NPROC);
Felix Moessbauer Sept. 11, 2024, 7:02 a.m. UTC | #2
On Tue, 2024-09-10 at 13:42 -0400, Waiman Long wrote:
> 
> On 9/10/24 13:11, Felix Moessbauer wrote:
> > The io worker threads are userland threads that just never exit to
> > the
> > userland. By that, they are also assigned to a cgroup (the group of
> > the
> > creating task).
> 
> The io-wq task is not actually assigned to a cgroup. To belong to a 
> cgroup, its pid has to be present to the cgroup.procs of the 
> corresponding cgroup, which is not the case here.

Hi, thanks for jumping in. As said, I'm not too familiar with the
internals of the io worker threads. Nonetheless, the kernel presents
the cgroup assignment quite consistently. This however contradicts your
statement from above. Example:

pid     tid
648460  648460  SCHED_OTHER   20  S    0  0-1  ./test/wq-aff.t
648460  648461  SCHED_OTHER   20  S    1  1    iou-sqp-648460
648460  648462  SCHED_OTHER   20  S    0  0    iou-wrk-648461

When I now check the cgroup.procs, I just see the 648460, which is
expected as this the process (with its main thread). Checking
cgroup.threads shows all three tids.

When checking the other way round, I get the same information:
$cat /proc/648460/task/648461/cgroup                                  
0::/user.slice/user-1000.slice/session-1.scope
$cat /proc/648460/task/648462/cgroup                                  
0::/user.slice/user-1000.slice/session-1.scope

Now I'm wondering if it is just presented incorrectly, or if these
tasks indeed belong to the mentioned cgroup?

> My understanding is
> that you are just restricting the CPU affinity to follow the cpuset
> of 
> the corresponding user task that creates it. The CPU affinity
> (cpumask) 
> is just one of the many resources controlled by a cgroup. That
> probably 
> needs to be clarified.

That's clear. Looking at the bigger picture, I want to ensure that the
io workers do not break out of the cgroup limits (I called it "ambient"
before, similar to the capabilites), because this breaks the isolation
assumption. In our case, we are mostly interested in not leaving the
cpuset, as we use that to perform system partitioning into realtime and
non realtime parts.

> 
> Besides cpumask, the cpuset controller also controls the node mask of
> the memory nodes allowed.

Yes, and that is especially important as some memory can be "closer" to
the IOs than others.

Best regards,
Felix
diff mbox series

Patch

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index c7055a8895d7..a38f36b68060 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -1168,7 +1168,7 @@  struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 
 	if (!alloc_cpumask_var(&wq->cpu_mask, GFP_KERNEL))
 		goto err;
-	cpumask_copy(wq->cpu_mask, cpu_possible_mask);
+	cpuset_cpus_allowed(data->task, wq->cpu_mask);
 	wq->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
 	wq->acct[IO_WQ_ACCT_UNBOUND].max_workers =
 				task_rlimit(current, RLIMIT_NPROC);