diff mbox series

[v7] io_uring: Statistics of the true utilization of sq threads.

Message ID 20240118073032.15015-1-xiaobing.li@samsung.com (mailing list archive)
State New
Headers show
Series [v7] io_uring: Statistics of the true utilization of sq threads. | expand

Commit Message

Xiaobing Li Jan. 18, 2024, 7:30 a.m. UTC
Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable description:
"work_time" in the code represents the sum of the jiffies of the sq
thread actually processing IO, that is, how many milliseconds it
actually takes to process IO. "total_time" represents the total time
that the sq thread has elapsed from the beginning of the loop to the
current time point, that is, how many milliseconds it has spent in
total.

The test tool is fio, and its parameters are as follows:
[global]
ioengine=io_uring
direct=1
group_reporting
bs=128k
norandommap=1
randrepeat=0
refill_buffers
ramp_time=30s
time_based
runtime=1m
clocksource=clock_gettime
overwrite=1
log_avg_msec=1000
numjobs=1

[disk0]
filename=/dev/nvme0n1
rw=read
iodepth=16
hipri
sqthread_poll=1

---
The test results are as follows:
Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
SqMask: 0x3
SqHead: 3197153
SqTail: 3197153
CachedSqHead:   3197153
SqThread:       9231
SqThreadCpu:    11
SqTotalTime:    18099614us
SqWorkTime:     16748316us

---
The test results corresponding to different iodepths are as follows:
|-----------|-------|-------|-------|------|-------|
|   iodepth |   1   |   4   |   8   |  16  |  64   |
|-----------|-------|-------|-------|------|-------|
|utilization| 2.9%  | 8.8%  | 10.9% | 92.9%| 84.4% |
|-----------|-------|-------|-------|------|-------|
|    idle   | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% |
|-----------|-------|-------|-------|------|-------|

changes:
v7:
 - Get the total time of the sqpoll thread through getrusage
 - The working time of the sqpoll thread is obtained through ktime_get()

v6:
 - Replace the percentages in the fdinfo output with the actual running
time and the time actually processing IO

v5:
 - list the changes in each iteration.

v4:
 - Resubmit the patch based on removing sq->lock

v3:
 - output actual working time as a percentage of total time
 - detailed description of the meaning of each variable
 - added test results

v2:
 - output the total statistical time and work time to fdinfo

v1:
 - initial version
 - Statistics of total time and work time

Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com>
---
 io_uring/fdinfo.c | 8 ++++++++
 io_uring/sqpoll.c | 7 +++++++
 io_uring/sqpoll.h | 1 +
 3 files changed, 16 insertions(+)

Comments

Jens Axboe Jan. 18, 2024, 7:34 p.m. UTC | #1
On 1/18/24 12:30 AM, Xiaobing Li wrote:
> diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
> index 976e9500f651..24a7452ed98e 100644
> --- a/io_uring/fdinfo.c
> +++ b/io_uring/fdinfo.c
> @@ -64,6 +64,7 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
>  	unsigned int sq_shift = 0;
>  	unsigned int sq_entries, cq_entries;
>  	int sq_pid = -1, sq_cpu = -1;
> +	long long sq_total_time = 0, sq_work_time = 0;
>  	bool has_lock;
>  	unsigned int i;
>  
> @@ -147,10 +148,17 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
>  
>  		sq_pid = sq->task_pid;
>  		sq_cpu = sq->sq_cpu;
> +		struct rusage r;
> +
> +		getrusage(sq->thread, RUSAGE_SELF, &r);
> +		sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
> +		sq_work_time = sq->work_time;
>  	}

I guess getrusage() is fine here, though I would probably just grab it
directly.

> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 65b5dbe3c850..f3e9fda72400 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -251,6 +251,9 @@ static int io_sq_thread(void *data)
>  		}
>  
>  		cap_entries = !list_is_singular(&sqd->ctx_list);
> +		ktime_t start, diff;
> +
> +		start = ktime_get();
>  		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
>  			int ret = __io_sq_thread(ctx, cap_entries);

But why on earth is this part then not doing getrusage() as well?


> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
> index 8df37e8c9149..c14c00240443 100644
> --- a/io_uring/sqpoll.h
> +++ b/io_uring/sqpoll.h
> @@ -16,6 +16,7 @@ struct io_sq_data {
>  	pid_t			task_pid;
>  	pid_t			task_tgid;
>  
> +	long long			work_time;
>  	unsigned long		state;
>  	struct completion	exited;
>  };

Probably just make that an u64.

As Pavel mentioned, I think we really need to consider if fdinfo is the
appropriate API for this. It's fine if you're running stuff directly and
you're just curious, but it's a very cumbersome API in general as you
need to know the pid of the task holding the ring, the fd of the ring,
and then you can get it as a textual description. If this is something
that is deemed useful, would it not make more sense to make it
programatically available in addition, or even exclusively?
Xiaobing Li Jan. 29, 2024, 7:18 a.m. UTC | #2
On 1/18/24 19:34, Jens Axboe wrote:
>> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
>> index 8df37e8c9149..c14c00240443 100644
>> --- a/io_uring/sqpoll.h
>> +++ b/io_uring/sqpoll.h
>> @@ -16,6 +16,7 @@ struct io_sq_data {
>>  	pid_t			task_pid;
>>  	pid_t			task_tgid;
>>  
>> +	long long			work_time;
>>  	unsigned long		state;
>>  	struct completion	exited;
>>  };
>
>Probably just make that an u64.
>
>As Pavel mentioned, I think we really need to consider if fdinfo is the
>appropriate API for this. It's fine if you're running stuff directly and
>you're just curious, but it's a very cumbersome API in general as you
>need to know the pid of the task holding the ring, the fd of the ring,
>and then you can get it as a textual description. If this is something
>that is deemed useful, would it not make more sense to make it
>programatically available in addition, or even exclusively?

Hi, Jens and Pavel
sorry for the late reply.

I've tried some other methods, but overall, I haven't found a more suitable 
method than fdinfo.
If you think it is troublesome to obtain the PID,  then I can provide
 a shell script to output the total_time and work_time of all sqpoll threads 
 to the terminal, so that we do not have to manually obtain the PID of each 
 thread (the script can be placed in tools/ include/io_uring).

eg:

PID    WorkTime(us)   TotalTime(us)   COMMAND
9330   1106578        2215321         iou-sqp-9329
9454   1510658        1715321         iou-sqp-9453
9478   165785         223219          iou-sqp-9477
9587   106578         153217          iou-sqp-9586

What do you think of this solution?
Jens Axboe Jan. 29, 2024, 2 p.m. UTC | #3
On 1/29/24 12:18 AM, Xiaobing Li wrote:
> On 1/18/24 19:34, Jens Axboe wrote:
>>> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
>>> index 8df37e8c9149..c14c00240443 100644
>>> --- a/io_uring/sqpoll.h
>>> +++ b/io_uring/sqpoll.h
>>> @@ -16,6 +16,7 @@ struct io_sq_data {
>>>  	pid_t			task_pid;
>>>  	pid_t			task_tgid;
>>>  
>>> +	long long			work_time;
>>>  	unsigned long		state;
>>>  	struct completion	exited;
>>>  };
>>
>> Probably just make that an u64.
>>
>> As Pavel mentioned, I think we really need to consider if fdinfo is the
>> appropriate API for this. It's fine if you're running stuff directly and
>> you're just curious, but it's a very cumbersome API in general as you
>> need to know the pid of the task holding the ring, the fd of the ring,
>> and then you can get it as a textual description. If this is something
>> that is deemed useful, would it not make more sense to make it
>> programatically available in addition, or even exclusively?
> 
> Hi, Jens and Pavel
> sorry for the late reply.
> 
> I've tried some other methods, but overall, I haven't found a more suitable 
> method than fdinfo.
> If you think it is troublesome to obtain the PID,  then I can provide
>  a shell script to output the total_time and work_time of all sqpoll threads 
>  to the terminal, so that we do not have to manually obtain the PID of each 
>  thread (the script can be placed in tools/ include/io_uring).
> 
> eg:
> 
> PID    WorkTime(us)   TotalTime(us)   COMMAND
> 9330   1106578        2215321         iou-sqp-9329
> 9454   1510658        1715321         iou-sqp-9453
> 9478   165785         223219          iou-sqp-9477
> 9587   106578         153217          iou-sqp-9586
> 
> What do you think of this solution?

I don't think it's a great interface, but at the same time, I don't feel
that strongly about it and perhaps bundling a script that outputs the
above in liburing would be Good Enough. I'm a bit reluctant to add a
stats API via io_uring_register() just for this.

So maybe spin a v8 with the s/long long/u64 change and include your
script as well?
Pavel Begunkov Jan. 29, 2024, 3:01 p.m. UTC | #4
On 1/29/24 07:18, Xiaobing Li wrote:
> On 1/18/24 19:34, Jens Axboe wrote:
>>> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
>>> index 8df37e8c9149..c14c00240443 100644
>>> --- a/io_uring/sqpoll.h
>>> +++ b/io_uring/sqpoll.h
>>> @@ -16,6 +16,7 @@ struct io_sq_data {
>>>   	pid_t			task_pid;
>>>   	pid_t			task_tgid;
>>>   
>>> +	long long			work_time;
>>>   	unsigned long		state;
>>>   	struct completion	exited;
>>>   };
>>
>> Probably just make that an u64.
>>
>> As Pavel mentioned, I think we really need to consider if fdinfo is the
>> appropriate API for this. It's fine if you're running stuff directly and
>> you're just curious, but it's a very cumbersome API in general as you
>> need to know the pid of the task holding the ring, the fd of the ring,
>> and then you can get it as a textual description. If this is something
>> that is deemed useful, would it not make more sense to make it
>> programatically available in addition, or even exclusively?
> 
> Hi, Jens and Pavel
> sorry for the late reply.
> 
> I've tried some other methods, but overall, I haven't found a more suitable
> method than fdinfo.

I wouldn't mind if it's fdinfo only for now, that can be changed later
if needed. I'm more concerned that reading fdinfo and then parsing it
is incompatible with the word performance, which you mentioned in the
context of using 1 vs 2 syscalls to get the stats.

That can be left to be resolved later, however. Let's just be clear
in docs that stats could be 0, which means the feature is not
working/disabled.

Another question I raised in my reply (v6 thread), why it's using
ktime_get(), which same as jiffies but more precise, instead of a
task time?


> If you think it is troublesome to obtain the PID,  then I can provide

I missed the context, where do we need to know PIDs?


>   a shell script to output the total_time and work_time of all sqpoll threads
>   to the terminal, so that we do not have to manually obtain the PID of each
>   thread (the script can be placed in tools/ include/io_uring).
> 
> eg:
> 
> PID    WorkTime(us)   TotalTime(us)   COMMAND
> 9330   1106578        2215321         iou-sqp-9329
> 9454   1510658        1715321         iou-sqp-9453
> 9478   165785         223219          iou-sqp-9477
> 9587   106578         153217          iou-sqp-9586
> 
> What do you think of this solution?
Xiaobing Li Jan. 30, 2024, 5:47 a.m. UTC | #5
On 1/29/24 15:01, Pavel Begunkov wrote:
>On 1/29/24 07:18, Xiaobing Li wrote:
>> On 1/18/24 19:34, Jens Axboe wrote:
>>>> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
>>>> index 8df37e8c9149..c14c00240443 100644
>>>> --- a/io_uring/sqpoll.h
>>>> +++ b/io_uring/sqpoll.h
>>>> @@ -16,6 +16,7 @@ struct io_sq_data {
>>>>   	pid_t			task_pid;
>>>>   	pid_t			task_tgid;
>>>>   
>>>> +	long long			work_time;
>>>>   	unsigned long		state;
>>>>   	struct completion	exited;
>>>>   };
>>>
>>> Probably just make that an u64.
>>>
>>> As Pavel mentioned, I think we really need to consider if fdinfo is the
>>> appropriate API for this. It's fine if you're running stuff directly and
>>> you're just curious, but it's a very cumbersome API in general as you
>>> need to know the pid of the task holding the ring, the fd of the ring,
>>> and then you can get it as a textual description. If this is something
>>> that is deemed useful, would it not make more sense to make it
>>> programatically available in addition, or even exclusively?
>> 
>> Hi, Jens and Pavel
>> sorry for the late reply.
>> 
>> I've tried some other methods, but overall, I haven't found a more suitable
>> method than fdinfo.
>
>I wouldn't mind if it's fdinfo only for now, that can be changed later
>if needed. I'm more concerned that reading fdinfo and then parsing it
>is incompatible with the word performance, which you mentioned in the
>context of using 1 vs 2 syscalls to get the stats.
>
>That can be left to be resolved later, however. Let's just be clear
>in docs that stats could be 0, which means the feature is not
>working/disabled.
>
>Another question I raised in my reply (v6 thread), why it's using
>ktime_get(), which same as jiffies but more precise, instead of a
>task time?

Sorry, I forgot to reply to you.
I was thinking wrong. you are right,  we can use "getrusage" to statistics 
the work_time of sqpoll in the ring.

>
>
>> If you think it is troublesome to obtain the PID,  then I can provide
>
>I missed the context, where do we need to know PIDs?

Since obtaining the fdinfo content of sqpoll requires finding the corresponding 
PID first, I guess Jens thinks it is troublesome to manually obtain the PID of 
each sqpoll thread when there are many sqpoll threads.
Therefore, I want to write a script that can automatically output all sqpoll
 statistics.
 
--
Xiaobing Li
Xiaobing Li Jan. 30, 2024, 6:57 a.m. UTC | #6
On 1/29/24 14:00, Jens Axboe wrote:
>On 1/29/24 12:18 AM, Xiaobing Li wrote:
>> On 1/18/24 19:34, Jens Axboe wrote:
>>>> diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
>>>> index 8df37e8c9149..c14c00240443 100644
>>>> --- a/io_uring/sqpoll.h
>>>> +++ b/io_uring/sqpoll.h
>>>> @@ -16,6 +16,7 @@ struct io_sq_data {
>>>>  	pid_t			task_pid;
>>>>  	pid_t			task_tgid;
>>>>  
>>>> +	long long			work_time;
>>>>  	unsigned long		state;
>>>>  	struct completion	exited;
>>>>  };
>>>
>>> Probably just make that an u64.
>>>
>>> As Pavel mentioned, I think we really need to consider if fdinfo is the
>>> appropriate API for this. It's fine if you're running stuff directly and
>>> you're just curious, but it's a very cumbersome API in general as you
>>> need to know the pid of the task holding the ring, the fd of the ring,
>>> and then you can get it as a textual description. If this is something
>>> that is deemed useful, would it not make more sense to make it
>>> programatically available in addition, or even exclusively?
>> 
>> Hi, Jens and Pavel
>> sorry for the late reply.
>> 
>> I've tried some other methods, but overall, I haven't found a more suitable 
>> method than fdinfo.
>> If you think it is troublesome to obtain the PID,  then I can provide
>>  a shell script to output the total_time and work_time of all sqpoll threads 
>>  to the terminal, so that we do not have to manually obtain the PID of each 
>>  thread (the script can be placed in tools/ include/io_uring).
>> 
>> eg:
>> 
>> PID    WorkTime(us)   TotalTime(us)   COMMAND
>> 9330   1106578        2215321         iou-sqp-9329
>> 9454   1510658        1715321         iou-sqp-9453
>> 9478   165785         223219          iou-sqp-9477
>> 9587   106578         153217          iou-sqp-9586
>> 
>> What do you think of this solution?
>
>I don't think it's a great interface, but at the same time, I don't feel
>that strongly about it and perhaps bundling a script that outputs the
>above in liburing would be Good Enough. I'm a bit reluctant to add a
>stats API via io_uring_register() just for this.
>
>So maybe spin a v8 with the s/long long/u64 change and include your
>script as well?
 
ok, I'll send out a v8.

--
Xiaobing Li
diff mbox series

Patch

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index 976e9500f651..24a7452ed98e 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -64,6 +64,7 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
 	unsigned int sq_shift = 0;
 	unsigned int sq_entries, cq_entries;
 	int sq_pid = -1, sq_cpu = -1;
+	long long sq_total_time = 0, sq_work_time = 0;
 	bool has_lock;
 	unsigned int i;
 
@@ -147,10 +148,17 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
 
 		sq_pid = sq->task_pid;
 		sq_cpu = sq->sq_cpu;
+		struct rusage r;
+
+		getrusage(sq->thread, RUSAGE_SELF, &r);
+		sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
+		sq_work_time = sq->work_time;
 	}
 
 	seq_printf(m, "SqThread:\t%d\n", sq_pid);
 	seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
+	seq_printf(m, "SqTotalTime:\t%lldus\n", sq_total_time);
+	seq_printf(m, "SqWorkTime:\t%lldus\n", sq_work_time);
 	seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
 	for (i = 0; has_lock && i < ctx->nr_user_files; i++) {
 		struct file *f = io_file_from_index(&ctx->file_table, i);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 65b5dbe3c850..f3e9fda72400 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -251,6 +251,9 @@  static int io_sq_thread(void *data)
 		}
 
 		cap_entries = !list_is_singular(&sqd->ctx_list);
+		ktime_t start, diff;
+
+		start = ktime_get();
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
 			int ret = __io_sq_thread(ctx, cap_entries);
 
@@ -260,6 +263,10 @@  static int io_sq_thread(void *data)
 		if (io_run_task_work())
 			sqt_spin = true;
 
+		diff = ktime_sub(ktime_get(), start);
+		if (sqt_spin == true)
+			sqd->work_time += ktime_to_us(diff);
+
 		if (sqt_spin || !time_after(jiffies, timeout)) {
 			if (sqt_spin)
 				timeout = jiffies + sqd->sq_thread_idle;
diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
index 8df37e8c9149..c14c00240443 100644
--- a/io_uring/sqpoll.h
+++ b/io_uring/sqpoll.h
@@ -16,6 +16,7 @@  struct io_sq_data {
 	pid_t			task_pid;
 	pid_t			task_tgid;
 
+	long long			work_time;
 	unsigned long		state;
 	struct completion	exited;
 };