Message ID | ffbbe596-c6a9-42ed-9156-e6d5c21eca9b@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrieval | expand |
On 11/15/23 2:36 AM, Jens Axboe wrote: > if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { > struct io_sq_data *sq = ctx->sq_data; > >- if (mutex_trylock(&sq->lock)) { >- if (sq->thread) { >- sq_pid = task_pid_nr(sq->thread); >- sq_cpu = task_cpu(sq->thread); >- } >- mutex_unlock(&sq->lock); >- } >+ sq_pid = sq->task_pid; >+ sq_cpu = sq->sq_cpu; > } There are two problems: 1.The output of SqThread is inaccurate. What is actually recorded is the PID of the parent process. 2. Sometimes it can output, sometimes it outputs -1. The test results are as follows: Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq SqMask: 0x3 SqHead: 6765744 SqTail: 6765744 CachedSqHead: 6765744 SqThread: -1 SqThreadCpu: -1 SqBusy: 0% ------------------------------------------- Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq SqMask: 0x3 SqHead: 7348727 SqTail: 7348728 CachedSqHead: 7348728 SqThread: 9571 SqThreadCpu: 174 SqBusy: 95%
On 11/14/23 11:10 PM, Xiaobing Li wrote: > On 11/15/23 2:36 AM, Jens Axboe wrote: >> if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { >> struct io_sq_data *sq = ctx->sq_data; >> >> - if (mutex_trylock(&sq->lock)) { >> - if (sq->thread) { >> - sq_pid = task_pid_nr(sq->thread); >> - sq_cpu = task_cpu(sq->thread); >> - } >> - mutex_unlock(&sq->lock); >> - } >> + sq_pid = sq->task_pid; >> + sq_cpu = sq->sq_cpu; >> } > > There are two problems: > 1.The output of SqThread is inaccurate. What is actually recorded is > the PID of the parent process. Doh yes, we need to reset this at the start of the thread, post assigning task_comm. I'll send out a v4 today. > 2. Sometimes it can output, sometimes it outputs -1. > > The test results are as follows: > Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq > SqMask: 0x3 > SqHead: 6765744 > SqTail: 6765744 > CachedSqHead: 6765744 > SqThread: -1 > SqThreadCpu: -1 > SqBusy: 0% > ------------------------------------------- > Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq > SqMask: 0x3 > SqHead: 7348727 > SqTail: 7348728 > CachedSqHead: 7348728 > SqThread: 9571 > SqThreadCpu: 174 > SqBusy: 95% Right, this is due to the uring_lock. We got rid of the main regression, which was the new trylock for the sqd->lock, but the old one remains. We can fix this as well for sqpoll info, but it's not a regression from past releases, it's always been like that. Pavel and I discussed it yesterday, and the easy solution is to make io_sq_data be under RCU protection. But that requires this patch first, so we don't have to fiddle with the sqpoll task itself. I can try and hack up the patch if you want to test it, it'd be on top of this one and for the next kernel release rather than 6.7.
On 11/15/23 6:42 AM, Jens Axboe wrote: >> 2. Sometimes it can output, sometimes it outputs -1. >> >> The test results are as follows: >> Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq >> SqMask: 0x3 >> SqHead: 6765744 >> SqTail: 6765744 >> CachedSqHead: 6765744 >> SqThread: -1 >> SqThreadCpu: -1 >> SqBusy: 0% >> ------------------------------------------- >> Every 0.5s: cat /proc/9572/fdinfo/6 | grep Sq >> SqMask: 0x3 >> SqHead: 7348727 >> SqTail: 7348728 >> CachedSqHead: 7348728 >> SqThread: 9571 >> SqThreadCpu: 174 >> SqBusy: 95% > > Right, this is due to the uring_lock. We got rid of the main > regression, which was the new trylock for the sqd->lock, but the old > one remains. We can fix this as well for sqpoll info, but it's not a > regression from past releases, it's always been like that. > > Pavel and I discussed it yesterday, and the easy solution is to make > io_sq_data be under RCU protection. But that requires this patch > first, so we don't have to fiddle with the sqpoll task itself. I can > try and hack up the patch if you want to test it, it'd be on top of > this one and for the next kernel release rather than 6.7. Something like this, totally untested. diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index 976e9500f651..434a21a6b653 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -142,11 +142,16 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) */ has_lock = mutex_trylock(&ctx->uring_lock); - if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { - struct io_sq_data *sq = ctx->sq_data; - - sq_pid = sq->task_pid; - sq_cpu = sq->sq_cpu; + if (ctx->flags & IORING_SETUP_SQPOLL) { + struct io_sq_data *sq; + + rcu_read_lock(); + sq = READ_ONCE(ctx->sq_data); + if (sq) { + sq_pid = sq->task_pid; + sq_cpu = sq->sq_cpu; + } + rcu_read_unlock(); } seq_printf(m, "SqThread:\t%d\n", sq_pid); diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index 65b5dbe3c850..583c76945cdf 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -70,7 +70,7 @@ void io_put_sq_data(struct io_sq_data *sqd) WARN_ON_ONCE(atomic_read(&sqd->park_pending)); io_sq_thread_stop(sqd); - kfree(sqd); + kfree_rcu(sqd, rcu); } } @@ -313,7 +313,7 @@ static int io_sq_thread(void *data) } io_uring_cancel_generic(true, sqd); - sqd->thread = NULL; + WRITE_ONCE(sqd->thread, NULL); list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags); io_run_task_work(); @@ -411,7 +411,7 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx, goto err_sqpoll; } - sqd->thread = tsk; + WRITE_ONCE(sqd->thread, tsk); ret = io_uring_alloc_task_context(tsk, ctx); wake_up_new_task(tsk); if (ret) diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h index 8df37e8c9149..0cf0c5833a27 100644 --- a/io_uring/sqpoll.h +++ b/io_uring/sqpoll.h @@ -18,6 +18,8 @@ struct io_sq_data { unsigned long state; struct completion exited; + + struct rcu_head rcu; }; int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_uring_params *p);
On 11/15/23 6:42 AM, Jens Axboe wrote: > */ > has_lock = mutex_trylock(&ctx->uring_lock); > >- if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { >- struct io_sq_data *sq = ctx->sq_data; >- >- sq_pid = sq->task_pid; >- sq_cpu = sq->sq_cpu; >+ if (ctx->flags & IORING_SETUP_SQPOLL) { >+ struct io_sq_data *sq; >+ >+ rcu_read_lock(); >+ sq = READ_ONCE(ctx->sq_data); >+ if (sq) { >+ sq_pid = sq->task_pid; >+ sq_cpu = sq->sq_cpu; >+ } >+ rcu_read_unlock(); > } > > seq_printf(m, "SqThread:\t%d\n", sq_pid); >diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c >index 65b5dbe3c850..583c76945cdf 100644 >--- a/io_uring/sqpoll.c >+++ b/io_uring/sqpoll.c >@@ -70,7 +70,7 @@ void io_put_sq_data(struct io_sq_data *sqd) > WARN_ON_ONCE(atomic_read(&sqd->park_pending)); > > io_sq_thread_stop(sqd); >- kfree(sqd); >+ kfree_rcu(sqd, rcu); > } > } > >@@ -313,7 +313,7 @@ static int io_sq_thread(void *data) > } > > io_uring_cancel_generic(true, sqd); >- sqd->thread = NULL; >+ WRITE_ONCE(sqd->thread, NULL); > list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) > atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags); > io_run_task_work(); >@@ -411,7 +411,7 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx, > goto err_sqpoll; > } > >- sqd->thread = tsk; >+ WRITE_ONCE(sqd->thread, tsk); > ret = io_uring_alloc_task_context(tsk, ctx); > wake_up_new_task(tsk); > if (ret) >diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h >index 8df37e8c9149..0cf0c5833a27 100644 >--- a/io_uring/sqpoll.h >+++ b/io_uring/sqpoll.h >@@ -18,6 +18,8 @@ struct io_sq_data { > > unsigned long state; > struct completion exited; >+ >+ struct rcu_head rcu; > }; > > int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_uring_params *p); I tested this and it worked after adding RCU lock. It consistently outputs correct results. The results of a simple test are as follows: Every 0.5s: cat /proc/10212/fdinfo/6 | grep Sq SqMask: 0x3 SqHead: 17422716 SqTail: 17422716 CachedSqHead: 17422716 SqThread: 10212 SqThreadCpu: 73 SqBusy: 97% ------------------------------------------------------------- But the name of the sq thread is "iou-sqp-" + "the PID of its parent process": PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10211 root 20 0 184408 8192 0 R 99.9 0.0 4:01.42 fio 10212 root 20 0 184408 8192 0 R 99.9 0.0 4:01.48 iou-sqp-10211 Is this the originally desired effect?
On 11/17/23 8:19 PM, Xiaobing Li wrote: > On 11/15/23 6:42 AM, Jens Axboe wrote: >> */ >> has_lock = mutex_trylock(&ctx->uring_lock); >> >> - if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { >> - struct io_sq_data *sq = ctx->sq_data; >> - >> - sq_pid = sq->task_pid; >> - sq_cpu = sq->sq_cpu; >> + if (ctx->flags & IORING_SETUP_SQPOLL) { >> + struct io_sq_data *sq; >> + >> + rcu_read_lock(); >> + sq = READ_ONCE(ctx->sq_data); >> + if (sq) { >> + sq_pid = sq->task_pid; >> + sq_cpu = sq->sq_cpu; >> + } >> + rcu_read_unlock(); >> } >> >> seq_printf(m, "SqThread:\t%d\n", sq_pid); >> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c >> index 65b5dbe3c850..583c76945cdf 100644 >> --- a/io_uring/sqpoll.c >> +++ b/io_uring/sqpoll.c >> @@ -70,7 +70,7 @@ void io_put_sq_data(struct io_sq_data *sqd) >> WARN_ON_ONCE(atomic_read(&sqd->park_pending)); >> >> io_sq_thread_stop(sqd); >> - kfree(sqd); >> + kfree_rcu(sqd, rcu); >> } >> } >> >> @@ -313,7 +313,7 @@ static int io_sq_thread(void *data) >> } >> >> io_uring_cancel_generic(true, sqd); >> - sqd->thread = NULL; >> + WRITE_ONCE(sqd->thread, NULL); >> list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) >> atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags); >> io_run_task_work(); >> @@ -411,7 +411,7 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx, >> goto err_sqpoll; >> } >> >> - sqd->thread = tsk; >> + WRITE_ONCE(sqd->thread, tsk); >> ret = io_uring_alloc_task_context(tsk, ctx); >> wake_up_new_task(tsk); >> if (ret) >> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h >> index 8df37e8c9149..0cf0c5833a27 100644 >> --- a/io_uring/sqpoll.h >> +++ b/io_uring/sqpoll.h >> @@ -18,6 +18,8 @@ struct io_sq_data { >> >> unsigned long state; >> struct completion exited; >> + >> + struct rcu_head rcu; >> }; >> >> int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_uring_params *p); > > I tested this and it worked after adding RCU lock. > It consistently outputs correct results. > > The results of a simple test are as follows: > Every 0.5s: cat /proc/10212/fdinfo/6 | grep Sq > SqMask: 0x3 > SqHead: 17422716 > SqTail: 17422716 > CachedSqHead: 17422716 > SqThread: 10212 > SqThreadCpu: 73 > SqBusy: 97% > ------------------------------------------------------------- > But the name of the sq thread is "iou-sqp-" + "the PID of its parent process": > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 10211 root 20 0 184408 8192 0 R 99.9 0.0 4:01.42 fio > 10212 root 20 0 184408 8192 0 R 99.9 0.0 4:01.48 iou-sqp-10211 > Is this the originally desired effect? That's how it has always been, it's the sqpoll thread belonging to that parent.
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index f04a43044d91..976e9500f651 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -145,13 +145,8 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { struct io_sq_data *sq = ctx->sq_data; - if (mutex_trylock(&sq->lock)) { - if (sq->thread) { - sq_pid = task_pid_nr(sq->thread); - sq_cpu = task_cpu(sq->thread); - } - mutex_unlock(&sq->lock); - } + sq_pid = sq->task_pid; + sq_cpu = sq->sq_cpu; } seq_printf(m, "SqThread:\t%d\n", sq_pid); diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index bd6c2c7959a5..a604cb6c5272 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -214,6 +214,7 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd) did_sig = get_signal(&ksig); cond_resched(); mutex_lock(&sqd->lock); + sqd->sq_cpu = raw_smp_processor_id(); } return did_sig || test_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state); } @@ -229,10 +230,12 @@ static int io_sq_thread(void *data) snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid); set_task_comm(current, buf); - if (sqd->sq_cpu != -1) + if (sqd->sq_cpu != -1) { set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu)); - else + } else { set_cpus_allowed_ptr(current, cpu_online_mask); + sqd->sq_cpu = raw_smp_processor_id(); + } mutex_lock(&sqd->lock); while (1) { @@ -261,6 +264,7 @@ static int io_sq_thread(void *data) mutex_unlock(&sqd->lock); cond_resched(); mutex_lock(&sqd->lock); + sqd->sq_cpu = raw_smp_processor_id(); } continue; } @@ -294,6 +298,7 @@ static int io_sq_thread(void *data) mutex_unlock(&sqd->lock); schedule(); mutex_lock(&sqd->lock); + sqd->sq_cpu = raw_smp_processor_id(); } list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) atomic_andnot(IORING_SQ_NEED_WAKEUP,
A previous commit added a trylock for getting the SQPOLL thread info via fdinfo, but this introduced a regression where we often fail to get it if the thread is busy. For that case, we end up not printing the current CPU and PID info. Rather than rely on this lock, just print the pid we already stored in the io_sq_data struct, and ensure we update the current CPU every time we've slept or potentially rescheduled. The latter won't potentially be 100% accurate, but that wasn't the case before either as the task can get migrated at any time unless it has been pinned at creation time. We retain keeping the io_sq_data dereference inside the ctx->uring_lock, as it has always been, as destruction of the thread and data happen below that. We could make this RCU safe, but there's little point in doing that. With this, we always print the last valid information we had, rather than have spurious outputs with missing information. Fixes: 7644b1a1c9a7 ("io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid") Signed-off-by: Jens Axboe <axboe@kernel.dk> --- v3: just use raw_smp_processor_id(), no need to query the task as it's the current one. also update it whenever we've re-acquired the lock after schedule or preemption, as it's cheap enough now to do so.