diff mbox series

[v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait

Message ID 20240816223640.1140763-1-dw@davidwei.uk (mailing list archive)
State New
Headers show
Series [v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait | expand

Commit Message

David Wei Aug. 16, 2024, 10:36 p.m. UTC
io_uring sets current->in_iowait when waiting for completions, which
achieves two things:

1. Proper accounting of the time as iowait time
2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq

For block IO this makes sense as high iowait can be indicative of
issues. But for network IO especially recv, the recv side does not
control when the completions happen.

Some user tooling attributes iowait time as CPU utilisation i.e. not
idle, so high iowait time looks like high CPU util even though the task
is not scheduled and the CPU is free to run other tasks. When doing
network IO with e.g. the batch completion feature, the CPU may appear to
have high utilisation.

This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
enter. If set, then current->in_iowait is not set. By default this flag
is not set to maintain existing behaviour i.e. in_iowait is always set.
This is to prevent waiting for completions being accounted as CPU
utilisation.

Not setting in_iowait does mean that we also lose cpufreq optimisations
above because in_iowait semantics couples 1 and 2 together. Eventually
we will untangle the two so the optimisations can be enabled
independently of the accounting.

IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
support. This will be used by liburing to check for this feature.

Signed-off-by: David Wei <dw@davidwei.uk>
---
v2:
 - squash patches into one
 - move no_iowait in struct io_wait_queue to the end
 - always set iowq.no_iowait

---
 include/uapi/linux/io_uring.h | 2 ++
 io_uring/io_uring.c           | 7 ++++---
 io_uring/io_uring.h           | 1 +
 3 files changed, 7 insertions(+), 3 deletions(-)

Comments

Jens Axboe Aug. 16, 2024, 10:49 p.m. UTC | #1
On 8/16/24 4:36 PM, David Wei wrote:
> io_uring sets current->in_iowait when waiting for completions, which
> achieves two things:
> 
> 1. Proper accounting of the time as iowait time
> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq
> 
> For block IO this makes sense as high iowait can be indicative of
> issues. But for network IO especially recv, the recv side does not
> control when the completions happen.
> 
> Some user tooling attributes iowait time as CPU utilisation i.e. not
> idle, so high iowait time looks like high CPU util even though the task
> is not scheduled and the CPU is free to run other tasks. When doing
> network IO with e.g. the batch completion feature, the CPU may appear to
> have high utilisation.
> 
> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
> enter. If set, then current->in_iowait is not set. By default this flag
> is not set to maintain existing behaviour i.e. in_iowait is always set.
> This is to prevent waiting for completions being accounted as CPU
> utilisation.
> 
> Not setting in_iowait does mean that we also lose cpufreq optimisations
> above because in_iowait semantics couples 1 and 2 together. Eventually
> we will untangle the two so the optimisations can be enabled
> independently of the accounting.
> 
> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
> support. This will be used by liburing to check for this feature.

LGTM
Jeff Moyer Aug. 17, 2024, 1:23 a.m. UTC | #2
Hi, David,

David Wei <dw@davidwei.uk> writes:

> io_uring sets current->in_iowait when waiting for completions, which
> achieves two things:
>
> 1. Proper accounting of the time as iowait time
> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq
>
> For block IO this makes sense as high iowait can be indicative of
> issues.

It also let's you know that the system isn't truly idle.  IOW, it would
be doing some work if it didn't have to wait for I/O.  This was the
reason the metric was added (admins being confused about why their
system was showing up idle).

> But for network IO especially recv, the recv side does not control
> when the completions happen.
>
> Some user tooling attributes iowait time as CPU utilisation i.e. not

What user tooling are you talking about?  If it shows iowait as busy
time, the tooling is broken.  Please see my last mail on the subject:
  https://lore.kernel.org/io-uring/x49cz0hxdfa.fsf@segfault.boston.devel.redhat.com/

> idle, so high iowait time looks like high CPU util even though the task
> is not scheduled and the CPU is free to run other tasks. When doing
> network IO with e.g. the batch completion feature, the CPU may appear to
> have high utilisation.

Again, iowait is idle time.

> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
> enter. If set, then current->in_iowait is not set. By default this flag
> is not set to maintain existing behaviour i.e. in_iowait is always set.
> This is to prevent waiting for completions being accounted as CPU
> utilisation.
>
> Not setting in_iowait does mean that we also lose cpufreq optimisations
> above because in_iowait semantics couples 1 and 2 together. Eventually
> we will untangle the two so the optimisations can be enabled
> independently of the accounting.
>
> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
> support. This will be used by liburing to check for this feature.

If I receive a problem report where iowait time isn't accurate, I now
have to somehow figure out if an application is setting this flag.  This
sounds like a support headache, and I do wonder what the benefit is.
From what you've written, the justification for the patch is that some
userspace tooling misinterprets iowait.  Shouldn't we just fix that?

It may be that certain (all?) network functions, like recv, should not
be accounted as iowait.  However, I don't think the onus should be on
applications to tell the kernel about that--the kernel should just
figure that out on its own.

Am I alone in these opinions?

Cheers,
Jeff


>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
> v2:
>  - squash patches into one
>  - move no_iowait in struct io_wait_queue to the end
>  - always set iowq.no_iowait
>
> ---
>  include/uapi/linux/io_uring.h | 2 ++
>  io_uring/io_uring.c           | 7 ++++---
>  io_uring/io_uring.h           | 1 +
>  3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 48c440edf674..3a94afa8665e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>  #define IORING_ENTER_EXT_ARG		(1U << 3)
>  #define IORING_ENTER_REGISTERED_RING	(1U << 4)
>  #define IORING_ENTER_ABS_TIMER		(1U << 5)
> +#define IORING_ENTER_NO_IOWAIT		(1U << 6)
>  
>  /*
>   * Passed in for io_uring_setup(2). Copied back with updated info on success
> @@ -543,6 +544,7 @@ struct io_uring_params {
>  #define IORING_FEAT_LINKED_FILE		(1U << 12)
>  #define IORING_FEAT_REG_REG_RING	(1U << 13)
>  #define IORING_FEAT_RECVSEND_BUNDLE	(1U << 14)
> +#define IORING_FEAT_IOWAIT_TOGGLE	(1U << 15)
>  
>  /*
>   * io_uring_register(2) opcodes and arguments
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 20229e72b65c..5e75672525df 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2372,7 +2372,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>  	 * can take into account that the task is waiting for IO - turns out
>  	 * to be important for low QD IO.
>  	 */
> -	if (current_pending_io())
> +	if (!iowq->no_iowait && current_pending_io())
>  		current->in_iowait = 1;
>  	ret = 0;
>  	if (iowq->timeout == KTIME_MAX)
> @@ -2414,6 +2414,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>  	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
>  	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
>  	iowq.timeout = KTIME_MAX;
> +	iowq.no_iowait = flags & IORING_ENTER_NO_IOWAIT;
>  
>  	if (uts) {
>  		struct timespec64 ts;
> @@ -3155,7 +3156,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>  	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
>  			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
>  			       IORING_ENTER_REGISTERED_RING |
> -			       IORING_ENTER_ABS_TIMER)))
> +			       IORING_ENTER_ABS_TIMER | IORING_ENTER_NO_IOWAIT)))
>  		return -EINVAL;
>  
>  	/*
> @@ -3539,7 +3540,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>  			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
>  			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
>  			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
> -			IORING_FEAT_RECVSEND_BUNDLE;
> +			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_IOWAIT_TOGGLE;
>  
>  	if (copy_to_user(params, p, sizeof(*p))) {
>  		ret = -EFAULT;
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index 9935819f12b7..426079a966ac 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -46,6 +46,7 @@ struct io_wait_queue {
>  	ktime_t napi_busy_poll_dt;
>  	bool napi_prefer_busy_poll;
>  #endif
> +	bool no_iowait;
>  };
>  
>  static inline bool io_should_wake(struct io_wait_queue *iowq)
Pavel Begunkov Aug. 17, 2024, 7:44 p.m. UTC | #3
On 8/16/24 23:36, David Wei wrote:
> io_uring sets current->in_iowait when waiting for completions, which
> achieves two things:
> 
> 1. Proper accounting of the time as iowait time
> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq

"achieve" is not the right word, nobody wanted 1. and it's not
"proper accounting" but rather an unfortunate side effect.

> For block IO this makes sense as high iowait can be indicative of
> issues. But for network IO especially recv, the recv side does not
> control when the completions happen.
> 
> Some user tooling attributes iowait time as CPU utilisation i.e. not
> idle, so high iowait time looks like high CPU util even though the task
> is not scheduled and the CPU is free to run other tasks. When doing
> network IO with e.g. the batch completion feature, the CPU may appear to
> have high utilisation.

How "batch completion" came into the picture? It elevates
iowait for any net apps, we have enough reports about it.


> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
> enter. If set, then current->in_iowait is not set. By default this flag

A worthwhile change but for _completely_ different reasons. So, first,
it's v3, not v2, considering the patchset from a couple month ago. And
since in essence nothing has changed, I can only repeat same points I
made back then.

The description reads like the flag's purpose is to change accounting,
and I'm vividly oppose any user exposed (per ring) toggle doing that.
We don't want the overhead, it's a very confusing feature, and not even
that helpful. iowait is monitored not by the app itself but by someone
else outside, likely by a different person, and even before trying to
make sense of numbers the monitor would need to learn first whether
_any_ program uses io_uring and what flags the application writer
decided to pass, even more fun when io_uring is used via a 3rd party
library.

Exactly same patches could make sense if you flip the description
and say "in_iowait is good for perfomance in some cases but
degrades power consumption for others, so here is a way to tune
performance", (just take Jamal's numbers). And that would need to
clearly state (including man) that the iowait statistic is a side
effect of it, we don't give it a thought, and the time accounting
aspect may and hopefully will change in the future.

Jens, can you remind what happened with separating iowait stats
vs the optimisation? I believed you sent some patches

> is not set to maintain existing behaviour i.e. in_iowait is always set.
> This is to prevent waiting for completions being accounted as CPU
> utilisation.

For accounting, it's more reasonable to keep it disabled by
default, so we stop getting millions complaints per day about
high iowait.

> Not setting in_iowait does mean that we also lose cpufreq optimisations
> above because in_iowait semantics couples 1 and 2 together. Eventually
> we will untangle the two so the optimisations can be enabled
> independently of the accounting.
> 
> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
> support. This will be used by liburing to check for this feature.
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
> v2:
>   - squash patches into one
>   - move no_iowait in struct io_wait_queue to the end
>   - always set iowq.no_iowait
> 
> ---
>   include/uapi/linux/io_uring.h | 2 ++
>   io_uring/io_uring.c           | 7 ++++---
>   io_uring/io_uring.h           | 1 +
>   3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 48c440edf674..3a94afa8665e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>   #define IORING_ENTER_EXT_ARG		(1U << 3)
>   #define IORING_ENTER_REGISTERED_RING	(1U << 4)
>   #define IORING_ENTER_ABS_TIMER		(1U << 5)
> +#define IORING_ENTER_NO_IOWAIT		(1U << 6)

Just curious, why did we switch from a register opcode to an
ENTER flag?
Jens Axboe Aug. 17, 2024, 8:20 p.m. UTC | #4
On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>> enter. If set, then current->in_iowait is not set. By default this flag
> 
> A worthwhile change but for _completely_ different reasons. So, first,
> it's v3, not v2, considering the patchset from a couple month ago. And
> since in essence nothing has changed, I can only repeat same points I
> made back then.
> 
> The description reads like the flag's purpose is to change accounting,
> and I'm vividly oppose any user exposed (per ring) toggle doing that.
> We don't want the overhead, it's a very confusing feature, and not even

Come on, there's no overhead associated with this, in practice.

It's really simple for this stuff - the freq boost is useful (and
needed) for some workloads, and the iowait accounting is never useful
for anything but (currently) comes as an unfortunate side effect of the
former. But even with those two separated, there are still going to be
cases where you want to control when it happens.

> that helpful. iowait is monitored not by the app itself but by someone
> else outside, likely by a different person, and even before trying to
> make sense of numbers the monitor would need to learn first whether
> _any_ program uses io_uring and what flags the application writer
> decided to pass, even more fun when io_uring is used via a 3rd party
> library.
> 
> Exactly same patches could make sense if you flip the description
> and say "in_iowait is good for perfomance in some cases but
> degrades power consumption for others, so here is a way to tune
> performance", (just take Jamal's numbers). And that would need to
> clearly state (including man) that the iowait statistic is a side
> effect of it, we don't give it a thought, and the time accounting
> aspect may and hopefully will change in the future.

Don't disagree with that, the boosting is the primary function here,
iowait accounting is just an odd relic and side effect.

> Jens, can you remind what happened with separating iowait stats
> vs the optimisation? I believed you sent some patches

Yep I still have them, and I'll dust them off and resend them. But it
doesn't really change the need for this at all, other than perhaps
wanting to rename the actual flag as it would not be about iowait at
all, it'd just be about power consumption.

Last cut was here:

https://git.kernel.dk/cgit/linux/log/?h=iowait.2

just need simple rebasing.

>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>> This is to prevent waiting for completions being accounted as CPU
>> utilisation.
> 
> For accounting, it's more reasonable to keep it disabled by
> default, so we stop getting millions complaints per day about
> high iowait.

Agree, it should just go away.

>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>> above because in_iowait semantics couples 1 and 2 together. Eventually
>> we will untangle the two so the optimisations can be enabled
>> independently of the accounting.
>>
>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>> support. This will be used by liburing to check for this feature.
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> ---
>> v2:
>>   - squash patches into one
>>   - move no_iowait in struct io_wait_queue to the end
>>   - always set iowq.no_iowait
>>
>> ---
>>   include/uapi/linux/io_uring.h | 2 ++
>>   io_uring/io_uring.c           | 7 ++++---
>>   io_uring/io_uring.h           | 1 +
>>   3 files changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 48c440edf674..3a94afa8665e 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>>   #define IORING_ENTER_EXT_ARG        (1U << 3)
>>   #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>   #define IORING_ENTER_ABS_TIMER        (1U << 5)
>> +#define IORING_ENTER_NO_IOWAIT        (1U << 6)
> 
> Just curious, why did we switch from a register opcode to an
> ENTER flag?

That was my suggestion, I think it's more flexible in that you can
disable the boost if you know you're doing longer waits or slower IO,
and enable it when frequencies go up. I don't particularly like the
static register approach for this.
Pavel Begunkov Aug. 17, 2024, 9:05 p.m. UTC | #5
On 8/17/24 21:20, Jens Axboe wrote:
> On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>>> enter. If set, then current->in_iowait is not set. By default this flag
>>
>> A worthwhile change but for _completely_ different reasons. So, first,
>> it's v3, not v2, considering the patchset from a couple month ago. And
>> since in essence nothing has changed, I can only repeat same points I
>> made back then.
>>
>> The description reads like the flag's purpose is to change accounting,
>> and I'm vividly oppose any user exposed (per ring) toggle doing that.
>> We don't want the overhead, it's a very confusing feature, and not even
> 
> Come on, there's no overhead associated with this, in practice.

Minor, right, ~ 1 "if" or ccmov, but that's for a feature that nobody
really cares about and arguably even harmful. That's assuming there
will be another flag for performance tuning in the future, at least
the description reads like that.

> It's really simple for this stuff - the freq boost is useful (and
> needed) for some workloads, and the iowait accounting is never useful
> for anything but (currently) comes as an unfortunate side effect of the
> former. But even with those two separated, there are still going to be
> cases where you want to control when it happens.

You can imagine such cases, but in reality I doubt it. If we
disable the stat part, nobody would notice as nobody cared for
last 3-4 years before in_iowait was added.

>> that helpful. iowait is monitored not by the app itself but by someone
>> else outside, likely by a different person, and even before trying to
>> make sense of numbers the monitor would need to learn first whether
>> _any_ program uses io_uring and what flags the application writer
>> decided to pass, even more fun when io_uring is used via a 3rd party
>> library.
>>
>> Exactly same patches could make sense if you flip the description
>> and say "in_iowait is good for perfomance in some cases but
>> degrades power consumption for others, so here is a way to tune
>> performance", (just take Jamal's numbers). And that would need to
>> clearly state (including man) that the iowait statistic is a side
>> effect of it, we don't give it a thought, and the time accounting
>> aspect may and hopefully will change in the future.
> 
> Don't disagree with that, the boosting is the primary function here,
> iowait accounting is just an odd relic and side effect.
> 
>> Jens, can you remind what happened with separating iowait stats
>> vs the optimisation? I believed you sent some patches
> 
> Yep I still have them, and I'll dust them off and resend them. But it
> doesn't really change the need for this at all, other than perhaps
> wanting to rename the actual flag as it would not be about iowait at
> all, it'd just be about power consumption.
> 
> Last cut was here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=iowait.2
> 
> just need simple rebasing.

IMHO we should try to merge it first. And if there would be
some friction, we can get back and take this patch, with
additional note describing about iowait stat side effects.

>>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>>> This is to prevent waiting for completions being accounted as CPU
>>> utilisation.
>>
>> For accounting, it's more reasonable to keep it disabled by
>> default, so we stop getting millions complaints per day about
>> high iowait.
> 
> Agree, it should just go away.
> 
>>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>>> above because in_iowait semantics couples 1 and 2 together. Eventually
>>> we will untangle the two so the optimisations can be enabled
>>> independently of the accounting.
>>>
>>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>>> support. This will be used by liburing to check for this feature.
>>> Signed-off-by: David Wei <dw@davidwei.uk>
>>> ---
>>> v2:
>>>    - squash patches into one
>>>    - move no_iowait in struct io_wait_queue to the end
>>>    - always set iowq.no_iowait
>>>
>>> ---
>>>    include/uapi/linux/io_uring.h | 2 ++
>>>    io_uring/io_uring.c           | 7 ++++---
>>>    io_uring/io_uring.h           | 1 +
>>>    3 files changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 48c440edf674..3a94afa8665e 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>>>    #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>    #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>>    #define IORING_ENTER_ABS_TIMER        (1U << 5)
>>> +#define IORING_ENTER_NO_IOWAIT        (1U << 6)
>>
>> Just curious, why did we switch from a register opcode to an
>> ENTER flag?
> 
> That was my suggestion, I think it's more flexible in that you can
> disable the boost if you know you're doing longer waits or slower IO,
> and enable it when frequencies go up. I don't particularly like the
> static register approach for this.

Makes sense, it's easier for the app to count how many
request of what type is inflight.

The name might also be confusing. We need an explanation when
it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
Do you remember how cpufreq accounts for it?
Jens Axboe Aug. 17, 2024, 9:09 p.m. UTC | #6
On 8/17/24 3:05 PM, Pavel Begunkov wrote:
> On 8/17/24 21:20, Jens Axboe wrote:
>> On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>>>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>>>> enter. If set, then current->in_iowait is not set. By default this flag
>>>
>>> A worthwhile change but for _completely_ different reasons. So, first,
>>> it's v3, not v2, considering the patchset from a couple month ago. And
>>> since in essence nothing has changed, I can only repeat same points I
>>> made back then.
>>>
>>> The description reads like the flag's purpose is to change accounting,
>>> and I'm vividly oppose any user exposed (per ring) toggle doing that.
>>> We don't want the overhead, it's a very confusing feature, and not even
>>
>> Come on, there's no overhead associated with this, in practice.
> 
> Minor, right, ~ 1 "if" or ccmov, but that's for a feature that nobody
> really cares about and arguably even harmful. That's assuming there
> will be another flag for performance tuning in the future, at least
> the description reads like that.

You literally brought up the case where people care yourself in the
reply, the one from Jamal. And the one I said is probably not the common
case, yet I think we should cater to it as it very well could be legit,
just in the tiny minority of cases.

>> It's really simple for this stuff - the freq boost is useful (and
>> needed) for some workloads, and the iowait accounting is never useful
>> for anything but (currently) comes as an unfortunate side effect of the
>> former. But even with those two separated, there are still going to be
>> cases where you want to control when it happens.
> 
> You can imagine such cases, but in reality I doubt it. If we
> disable the stat part, nobody would notice as nobody cared for
> last 3-4 years before in_iowait was added.

That would be ideal. You're saying Jamal's complaint was purely iowait
based? Because it looked like power concerns to me... If it's just
iowait, then they just need to stop looking at that, that's pretty
simple.

>>> that helpful. iowait is monitored not by the app itself but by someone
>>> else outside, likely by a different person, and even before trying to
>>> make sense of numbers the monitor would need to learn first whether
>>> _any_ program uses io_uring and what flags the application writer
>>> decided to pass, even more fun when io_uring is used via a 3rd party
>>> library.
>>>
>>> Exactly same patches could make sense if you flip the description
>>> and say "in_iowait is good for perfomance in some cases but
>>> degrades power consumption for others, so here is a way to tune
>>> performance", (just take Jamal's numbers). And that would need to
>>> clearly state (including man) that the iowait statistic is a side
>>> effect of it, we don't give it a thought, and the time accounting
>>> aspect may and hopefully will change in the future.
>>
>> Don't disagree with that, the boosting is the primary function here,
>> iowait accounting is just an odd relic and side effect.
>>
>>> Jens, can you remind what happened with separating iowait stats
>>> vs the optimisation? I believed you sent some patches
>>
>> Yep I still have them, and I'll dust them off and resend them. But it
>> doesn't really change the need for this at all, other than perhaps
>> wanting to rename the actual flag as it would not be about iowait at
>> all, it'd just be about power consumption.
>>
>> Last cut was here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=iowait.2
>>
>> just need simple rebasing.
> 
> IMHO we should try to merge it first. And if there would be
> some friction, we can get back and take this patch, with
> additional note describing about iowait stat side effects.

I did just rebase on top of sched/core and sent out a new version. I
still think it makes sense regardless of this patch or not, only
difference would be what you call the flag.

It's here:

https://lore.kernel.org/lkml/20240817204639.132794-1-axboe@kernel.dk/T/#mab3638659a54193927351ebdc5e345278637bc41

>>>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>>>> This is to prevent waiting for completions being accounted as CPU
>>>> utilisation.
>>>
>>> For accounting, it's more reasonable to keep it disabled by
>>> default, so we stop getting millions complaints per day about
>>> high iowait.
>>
>> Agree, it should just go away.
>>
>>>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>>>> above because in_iowait semantics couples 1 and 2 together. Eventually
>>>> we will untangle the two so the optimisations can be enabled
>>>> independently of the accounting.
>>>>
>>>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>>>> support. This will be used by liburing to check for this feature.
>>>> Signed-off-by: David Wei <dw@davidwei.uk>
>>>> ---
>>>> v2:
>>>>    - squash patches into one
>>>>    - move no_iowait in struct io_wait_queue to the end
>>>>    - always set iowq.no_iowait
>>>>
>>>> ---
>>>>    include/uapi/linux/io_uring.h | 2 ++
>>>>    io_uring/io_uring.c           | 7 ++++---
>>>>    io_uring/io_uring.h           | 1 +
>>>>    3 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 48c440edf674..3a94afa8665e 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>>>>    #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>>    #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>>>    #define IORING_ENTER_ABS_TIMER        (1U << 5)
>>>> +#define IORING_ENTER_NO_IOWAIT        (1U << 6)
>>>
>>> Just curious, why did we switch from a register opcode to an
>>> ENTER flag?
>>
>> That was my suggestion, I think it's more flexible in that you can
>> disable the boost if you know you're doing longer waits or slower IO,
>> and enable it when frequencies go up. I don't particularly like the
>> static register approach for this.
> 
> Makes sense, it's easier for the app to count how many
> request of what type is inflight.

Exactly

> The name might also be confusing. We need an explanation when
> it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
> Do you remember how cpufreq accounts for it?

I don't remember how it accounts for it, and was just pondering that
with the above reply. Because if it just decays the sleep state, then
you could just use it generically. If it stays high regardless of how
long you wait, then it could be a power issue. Not on servers really
(well a bit, depending on boosting), but more so on desktop apps.
Laptops tend to be pretty power conservative!
Pavel Begunkov Aug. 17, 2024, 10:04 p.m. UTC | #7
On 8/17/24 22:09, Jens Axboe wrote:
> On 8/17/24 3:05 PM, Pavel Begunkov wrote:
>> On 8/17/24 21:20, Jens Axboe wrote:
>>> On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>>>>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>>>>> enter. If set, then current->in_iowait is not set. By default this flag
>>>>
>>>> A worthwhile change but for _completely_ different reasons. So, first,
>>>> it's v3, not v2, considering the patchset from a couple month ago. And
>>>> since in essence nothing has changed, I can only repeat same points I
>>>> made back then.
>>>>
>>>> The description reads like the flag's purpose is to change accounting,
>>>> and I'm vividly oppose any user exposed (per ring) toggle doing that.
>>>> We don't want the overhead, it's a very confusing feature, and not even
>>>
>>> Come on, there's no overhead associated with this, in practice.
>>
>> Minor, right, ~ 1 "if" or ccmov, but that's for a feature that nobody
>> really cares about and arguably even harmful. That's assuming there
>> will be another flag for performance tuning in the future, at least
>> the description reads like that.
> 
> You literally brought up the case where people care yourself in the
> reply, the one from Jamal. And the one I said is probably not the common

Not really. Yes, by chance high iowait helped to narrow it to the
in_iowait io_uring patch, but that's rather a weird side effect,
not what iowait was "created for", i.e. system is idling because
storage waiting takes too long.

And that "use case" for iowait directly linked to cpufreq, so
if it still counts, then we shouldn't be separating stats from
cpufreq at all.

> case, yet I think we should cater to it as it very well could be legit,
> just in the tiny minority of cases.

I explained why it's a confusing feature. We can make up some niche
case (with enough of imagination we can justify basically anything),
but I explained why IMHO accounting flag (let's forget about
cpufreq) would have net negative effect. A sysctl knob would be
much more reasonable, but I don't think it's needed at all.


>>> It's really simple for this stuff - the freq boost is useful (and
>>> needed) for some workloads, and the iowait accounting is never useful
>>> for anything but (currently) comes as an unfortunate side effect of the
>>> former. But even with those two separated, there are still going to be
>>> cases where you want to control when it happens.
>>
>> You can imagine such cases, but in reality I doubt it. If we
>> disable the stat part, nobody would notice as nobody cared for
>> last 3-4 years before in_iowait was added.
> 
> That would be ideal. You're saying Jamal's complaint was purely iowait
> based? Because it looked like power concerns to me... If it's just
> iowait, then they just need to stop looking at that, that's pretty
> simple.

Power consumption, and then, in search of what's wrong, it was
correlated to high iowait as well as difference in C state stats.


>>>> that helpful. iowait is monitored not by the app itself but by someone
>>>> else outside, likely by a different person, and even before trying to
>>>> make sense of numbers the monitor would need to learn first whether
>>>> _any_ program uses io_uring and what flags the application writer
>>>> decided to pass, even more fun when io_uring is used via a 3rd party
>>>> library.
>>>>
>>>> Exactly same patches could make sense if you flip the description
>>>> and say "in_iowait is good for perfomance in some cases but
>>>> degrades power consumption for others, so here is a way to tune
>>>> performance", (just take Jamal's numbers). And that would need to
>>>> clearly state (including man) that the iowait statistic is a side
>>>> effect of it, we don't give it a thought, and the time accounting
>>>> aspect may and hopefully will change in the future.
>>>
>>> Don't disagree with that, the boosting is the primary function here,
>>> iowait accounting is just an odd relic and side effect.
>>>
>>>> Jens, can you remind what happened with separating iowait stats
>>>> vs the optimisation? I believed you sent some patches
>>>
>>> Yep I still have them, and I'll dust them off and resend them. But it
>>> doesn't really change the need for this at all, other than perhaps
>>> wanting to rename the actual flag as it would not be about iowait at
>>> all, it'd just be about power consumption.
>>>
>>> Last cut was here:
>>>
>>> https://git.kernel.dk/cgit/linux/log/?h=iowait.2
>>>
>>> just need simple rebasing.
>>
>> IMHO we should try to merge it first. And if there would be
>> some friction, we can get back and take this patch, with
>> additional note describing about iowait stat side effects.
> 
> I did just rebase on top of sched/core and sent out a new version. I
> still think it makes sense regardless of this patch or not, only
> difference would be what you call the flag.

There shouldn't be any difference in how it's called or how
it's recommended to be used, but maybe I don't understand
what you mean. As I see it, the difference in a blob of
text in the man mentioning the iowait stat side effect.

> It's here:
> 
> https://lore.kernel.org/lkml/20240817204639.132794-1-axboe@kernel.dk/T/#mab3638659a54193927351ebdc5e345278637bc41

Perfect


>>>>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>>>>> This is to prevent waiting for completions being accounted as CPU
>>>>> utilisation.
>>>>
>>>> For accounting, it's more reasonable to keep it disabled by
>>>> default, so we stop getting millions complaints per day about
>>>> high iowait.
>>>
>>> Agree, it should just go away.
>>>
>>>>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>>>>> above because in_iowait semantics couples 1 and 2 together. Eventually
>>>>> we will untangle the two so the optimisations can be enabled
>>>>> independently of the accounting.
>>>>>
>>>>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>>>>> support. This will be used by liburing to check for this feature.
>>>>> Signed-off-by: David Wei <dw@davidwei.uk>
>>>>> ---
>>>>> v2:
>>>>>     - squash patches into one
>>>>>     - move no_iowait in struct io_wait_queue to the end
>>>>>     - always set iowq.no_iowait
>>>>>
>>>>> ---
>>>>>     include/uapi/linux/io_uring.h | 2 ++
>>>>>     io_uring/io_uring.c           | 7 ++++---
>>>>>     io_uring/io_uring.h           | 1 +
>>>>>     3 files changed, 7 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index 48c440edf674..3a94afa8665e 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>>>>>     #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>>>     #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>>>>     #define IORING_ENTER_ABS_TIMER        (1U << 5)
>>>>> +#define IORING_ENTER_NO_IOWAIT        (1U << 6)
>>>>
>>>> Just curious, why did we switch from a register opcode to an
>>>> ENTER flag?
>>>
>>> That was my suggestion, I think it's more flexible in that you can
>>> disable the boost if you know you're doing longer waits or slower IO,
>>> and enable it when frequencies go up. I don't particularly like the
>>> static register approach for this.
>>
>> Makes sense, it's easier for the app to count how many
>> request of what type is inflight.
> 
> Exactly
> 
>> The name might also be confusing. We need an explanation when
>> it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
>> Do you remember how cpufreq accounts for it?
> 
> I don't remember how it accounts for it, and was just pondering that
> with the above reply. Because if it just decays the sleep state, then
> you could just use it generically. If it stays high regardless of how
> long you wait, then it could be a power issue. Not on servers really
> (well a bit, depending on boosting), but more so on desktop apps.
> Laptops tend to be pretty power conservative!

{SHORT,BRIEF}/LONG_WAIT maybe?
Jens Axboe Aug. 18, 2024, 1:08 a.m. UTC | #8
On 8/17/24 4:04 PM, Pavel Begunkov wrote:
> On 8/17/24 22:09, Jens Axboe wrote:
>> On 8/17/24 3:05 PM, Pavel Begunkov wrote:
>>> On 8/17/24 21:20, Jens Axboe wrote:
>>>> On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>>>>>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>>>>>> enter. If set, then current->in_iowait is not set. By default this flag
>>>>>
>>>>> A worthwhile change but for _completely_ different reasons. So, first,
>>>>> it's v3, not v2, considering the patchset from a couple month ago. And
>>>>> since in essence nothing has changed, I can only repeat same points I
>>>>> made back then.
>>>>>
>>>>> The description reads like the flag's purpose is to change accounting,
>>>>> and I'm vividly oppose any user exposed (per ring) toggle doing that.
>>>>> We don't want the overhead, it's a very confusing feature, and not even
>>>>
>>>> Come on, there's no overhead associated with this, in practice.
>>>
>>> Minor, right, ~ 1 "if" or ccmov, but that's for a feature that nobody
>>> really cares about and arguably even harmful. That's assuming there
>>> will be another flag for performance tuning in the future, at least
>>> the description reads like that.
>>
>> You literally brought up the case where people care yourself in the
>> reply, the one from Jamal. And the one I said is probably not the common
> 
> Not really. Yes, by chance high iowait helped to narrow it to the
> in_iowait io_uring patch, but that's rather a weird side effect,
> not what iowait was "created for", i.e. system is idling because
> storage waiting takes too long.

It's definitely a smoking gun, if you're already looking for that. But
the main question remains - was this an actual power issue or not?
Because the answer to that is pretty key imho.

> And that "use case" for iowait directly linked to cpufreq, so
> if it still counts, then we shouldn't be separating stats from
> cpufreq at all.

This is what the cpufreq people want to do anyway, so it'll probably
happen whether we like it or not.

>> case, yet I think we should cater to it as it very well could be legit,
>> just in the tiny minority of cases.
> 
> I explained why it's a confusing feature. We can make up some niche
> case (with enough of imagination we can justify basically anything),
> but I explained why IMHO accounting flag (let's forget about
> cpufreq) would have net negative effect. A sysctl knob would be
> much more reasonable, but I don't think it's needed at all.

The main thing for me is policy vs flexibility. The fact that boost and
iowait accounting is currently tied together is pretty ugly and will
hopefully go away with my patches.

>>>> It's really simple for this stuff - the freq boost is useful (and
>>>> needed) for some workloads, and the iowait accounting is never useful
>>>> for anything but (currently) comes as an unfortunate side effect of the
>>>> former. But even with those two separated, there are still going to be
>>>> cases where you want to control when it happens.
>>>
>>> You can imagine such cases, but in reality I doubt it. If we
>>> disable the stat part, nobody would notice as nobody cared for
>>> last 3-4 years before in_iowait was added.
>>
>> That would be ideal. You're saying Jamal's complaint was purely iowait
>> based? Because it looked like power concerns to me... If it's just
>> iowait, then they just need to stop looking at that, that's pretty
>> simple.
> 
> Power consumption, and then, in search of what's wrong, it was
> correlated to high iowait as well as difference in C state stats.

But this means that it was indeed power consumption, and iowait was just
the canary in the coal mine that lead them down the right path.

And this in turn means that even with the split, we want to
differentiate between short/busty sleeps and longer ones.

>>>>> that helpful. iowait is monitored not by the app itself but by someone
>>>>> else outside, likely by a different person, and even before trying to
>>>>> make sense of numbers the monitor would need to learn first whether
>>>>> _any_ program uses io_uring and what flags the application writer
>>>>> decided to pass, even more fun when io_uring is used via a 3rd party
>>>>> library.
>>>>>
>>>>> Exactly same patches could make sense if you flip the description
>>>>> and say "in_iowait is good for perfomance in some cases but
>>>>> degrades power consumption for others, so here is a way to tune
>>>>> performance", (just take Jamal's numbers). And that would need to
>>>>> clearly state (including man) that the iowait statistic is a side
>>>>> effect of it, we don't give it a thought, and the time accounting
>>>>> aspect may and hopefully will change in the future.
>>>>
>>>> Don't disagree with that, the boosting is the primary function here,
>>>> iowait accounting is just an odd relic and side effect.
>>>>
>>>>> Jens, can you remind what happened with separating iowait stats
>>>>> vs the optimisation? I believed you sent some patches
>>>>
>>>> Yep I still have them, and I'll dust them off and resend them. But it
>>>> doesn't really change the need for this at all, other than perhaps
>>>> wanting to rename the actual flag as it would not be about iowait at
>>>> all, it'd just be about power consumption.
>>>>
>>>> Last cut was here:
>>>>
>>>> https://git.kernel.dk/cgit/linux/log/?h=iowait.2
>>>>
>>>> just need simple rebasing.
>>>
>>> IMHO we should try to merge it first. And if there would be
>>> some friction, we can get back and take this patch, with
>>> additional note describing about iowait stat side effects.
>>
>> I did just rebase on top of sched/core and sent out a new version. I
>> still think it makes sense regardless of this patch or not, only
>> difference would be what you call the flag.
> 
> There shouldn't be any difference in how it's called or how
> it's recommended to be used, but maybe I don't understand
> what you mean. As I see it, the difference in a blob of
> text in the man mentioning the iowait stat side effect.

See above.

>> It's here:
>>
>> https://lore.kernel.org/lkml/20240817204639.132794-1-axboe@kernel.dk/T/#mab3638659a54193927351ebdc5e345278637bc41
> 
> Perfect
> 
> 
>>>>>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>>>>>> This is to prevent waiting for completions being accounted as CPU
>>>>>> utilisation.
>>>>>
>>>>> For accounting, it's more reasonable to keep it disabled by
>>>>> default, so we stop getting millions complaints per day about
>>>>> high iowait.
>>>>
>>>> Agree, it should just go away.
>>>>
>>>>>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>>>>>> above because in_iowait semantics couples 1 and 2 together. Eventually
>>>>>> we will untangle the two so the optimisations can be enabled
>>>>>> independently of the accounting.
>>>>>>
>>>>>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>>>>>> support. This will be used by liburing to check for this feature.
>>>>>> Signed-off-by: David Wei <dw@davidwei.uk>
>>>>>> ---
>>>>>> v2:
>>>>>>     - squash patches into one
>>>>>>     - move no_iowait in struct io_wait_queue to the end
>>>>>>     - always set iowq.no_iowait
>>>>>>
>>>>>> ---
>>>>>>     include/uapi/linux/io_uring.h | 2 ++
>>>>>>     io_uring/io_uring.c           | 7 ++++---
>>>>>>     io_uring/io_uring.h           | 1 +
>>>>>>     3 files changed, 7 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>>> index 48c440edf674..3a94afa8665e 100644
>>>>>> --- a/include/uapi/linux/io_uring.h
>>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>>> @@ -508,6 +508,7 @@ struct io_cqring_offsets {
>>>>>>     #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>>>>     #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>>>>>     #define IORING_ENTER_ABS_TIMER        (1U << 5)
>>>>>> +#define IORING_ENTER_NO_IOWAIT        (1U << 6)
>>>>>
>>>>> Just curious, why did we switch from a register opcode to an
>>>>> ENTER flag?
>>>>
>>>> That was my suggestion, I think it's more flexible in that you can
>>>> disable the boost if you know you're doing longer waits or slower IO,
>>>> and enable it when frequencies go up. I don't particularly like the
>>>> static register approach for this.
>>>
>>> Makes sense, it's easier for the app to count how many
>>> request of what type is inflight.
>>
>> Exactly
>>
>>> The name might also be confusing. We need an explanation when
>>> it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
>>> Do you remember how cpufreq accounts for it?
>>
>> I don't remember how it accounts for it, and was just pondering that
>> with the above reply. Because if it just decays the sleep state, then
>> you could just use it generically. If it stays high regardless of how
>> long you wait, then it could be a power issue. Not on servers really
>> (well a bit, depending on boosting), but more so on desktop apps.
>> Laptops tend to be pretty power conservative!
> 
> {SHORT,BRIEF}/LONG_WAIT maybe?

I think that's a lot more descriptive. Ideally we'd want to tie this to
wakeup latencies, eg we'd need to know about wakeup latencies. For
example, if the user asks for a 100 usec wait, we'd want to influence
what sleep state is picked in propagating that information. Things like
the min-wait I posted would directly work for that, as it tells the
story in two chapters on what waits we're expecting here. Currently
there's no way to do that (hence iowait -> cpufreq boosting), but there
clearly should be. Or even without min-wait, the timeout is clearly
known here, combined with the expected/desired number of events the
application is looking for.
Pavel Begunkov Aug. 18, 2024, 2:27 a.m. UTC | #9
On 8/18/24 02:08, Jens Axboe wrote:
> On 8/17/24 4:04 PM, Pavel Begunkov wrote:
>> On 8/17/24 22:09, Jens Axboe wrote:
>>> On 8/17/24 3:05 PM, Pavel Begunkov wrote:
>>>> On 8/17/24 21:20, Jens Axboe wrote:
>>>>> On 8/17/24 1:44 PM, Pavel Begunkov wrote:
>>>>>>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>>>>>>> enter. If set, then current->in_iowait is not set. By default this flag
...
>> And that "use case" for iowait directly linked to cpufreq, so
>> if it still counts, then we shouldn't be separating stats from
>> cpufreq at all.
> 
> This is what the cpufreq people want to do anyway, so it'll probably
> happen whether we like it or not.

Not against it, quite the opposite


>>> case, yet I think we should cater to it as it very well could be legit,
>>> just in the tiny minority of cases.
>>
>> I explained why it's a confusing feature. We can make up some niche
>> case (with enough of imagination we can justify basically anything),
>> but I explained why IMHO accounting flag (let's forget about
>> cpufreq) would have net negative effect. A sysctl knob would be
>> much more reasonable, but I don't think it's needed at all.
> 
> The main thing for me is policy vs flexibility. The fact that boost and
> iowait accounting is currently tied together is pretty ugly and will
> hopefully go away with my patches.
> 
>>>>> It's really simple for this stuff - the freq boost is useful (and
>>>>> needed) for some workloads, and the iowait accounting is never useful
>>>>> for anything but (currently) comes as an unfortunate side effect of the
>>>>> former. But even with those two separated, there are still going to be
>>>>> cases where you want to control when it happens.
>>>>
>>>> You can imagine such cases, but in reality I doubt it. If we
>>>> disable the stat part, nobody would notice as nobody cared for
>>>> last 3-4 years before in_iowait was added.
>>>
>>> That would be ideal. You're saying Jamal's complaint was purely iowait
>>> based? Because it looked like power concerns to me... If it's just
>>> iowait, then they just need to stop looking at that, that's pretty
>>> simple.
>>
>> Power consumption, and then, in search of what's wrong, it was
>> correlated to high iowait as well as difference in C state stats.
> 
> But this means that it was indeed power consumption, and iowait was just
> the canary in the coal mine that lead them down the right path.
> 
> And this in turn means that even with the split, we want to
> differentiate between short/busty sleeps and longer ones.

That's what I've been talking about since a couple of months ago,
for networking we have a well measured energy consumption
regression because of iowait, not like we can just leave it as
it is now. And For the lack of a good way to auto tune in the
kernel, an enter flag (described as a performance feature) looks
good, I agree.

...
>>>
>>>> The name might also be confusing. We need an explanation when
>>>> it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
>>>> Do you remember how cpufreq accounts for it?
>>>
>>> I don't remember how it accounts for it, and was just pondering that
>>> with the above reply. Because if it just decays the sleep state, then
>>> you could just use it generically. If it stays high regardless of how
>>> long you wait, then it could be a power issue. Not on servers really
>>> (well a bit, depending on boosting), but more so on desktop apps.
>>> Laptops tend to be pretty power conservative!
>>
>> {SHORT,BRIEF}/LONG_WAIT maybe?
> 
> I think that's a lot more descriptive. Ideally we'd want to tie this to
> wakeup latencies, eg we'd need to know about wakeup latencies. For
> example, if the user asks for a 100 usec wait, we'd want to influence
> what sleep state is picked in propagating that information. Things like
> the min-wait I posted would directly work for that, as it tells the
> story in two chapters on what waits we're expecting here. Currently
> there's no way to do that (hence iowait -> cpufreq boosting), but there
> clearly should be. Or even without min-wait, the timeout is clearly
> known here, combined with the expected/desired number of events the
> application is looking for.

Yeah, interesting, we can auto apply it depending on the delta
time, etc. Might worth to ask the cpufreq guys about thresholds.
David Wei Aug. 19, 2024, 11:03 p.m. UTC | #10
On 2024-08-16 18:23, Jeff Moyer wrote:
> Hi, David,
> 
> David Wei <dw@davidwei.uk> writes:
> 
>> io_uring sets current->in_iowait when waiting for completions, which
>> achieves two things:
>>
>> 1. Proper accounting of the time as iowait time
>> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq
>>
>> For block IO this makes sense as high iowait can be indicative of
>> issues.
> 
> It also let's you know that the system isn't truly idle.  IOW, it would
> be doing some work if it didn't have to wait for I/O.  This was the
> reason the metric was added (admins being confused about why their
> system was showing up idle).

I see. Thanks for the historical context.

> 
>> But for network IO especially recv, the recv side does not control
>> when the completions happen.
>>
>> Some user tooling attributes iowait time as CPU utilisation i.e. not
> 
> What user tooling are you talking about?  If it shows iowait as busy
> time, the tooling is broken.  Please see my last mail on the subject:
>   https://lore.kernel.org/io-uring/x49cz0hxdfa.fsf@segfault.boston.devel.redhat.com/

Our internal tooling for example considers CPU util% to be (100 -
idle%), but it also has a CPU busy% defined as (100 - idle% - iowait%).
It is very unfortunate that everyone uses CPU util% for monitoring, with
all sorts of alerts, dashboards and load balancers referring to this
value. One reason is that, depending on context, high iowait time may or
may not be a problem, so it isn't as simple as redefining CPU util% to
exclude iowait.

> 
>> idle, so high iowait time looks like high CPU util even though the task
>> is not scheduled and the CPU is free to run other tasks. When doing
>> network IO with e.g. the batch completion feature, the CPU may appear to
>> have high utilisation.
> 
> Again, iowait is idle time.

That's fair. I think it is simpler to have a single "CPU util" metric
defined as (100 - idle%), and have a switch that userspace explicitly
flips to say "I want iowait to be considered truly idle or not". This
means things such as load balancers can be built around a single metric,
rather than having to consider both util/busy and needing to understand
"does iowait mean anything?".

> 
>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>> enter. If set, then current->in_iowait is not set. By default this flag
>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>> This is to prevent waiting for completions being accounted as CPU
>> utilisation.
>>
>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>> above because in_iowait semantics couples 1 and 2 together. Eventually
>> we will untangle the two so the optimisations can be enabled
>> independently of the accounting.
>>
>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>> support. This will be used by liburing to check for this feature.
> 
> If I receive a problem report where iowait time isn't accurate, I now
> have to somehow figure out if an application is setting this flag.  This
> sounds like a support headache, and I do wonder what the benefit is.
> From what you've written, the justification for the patch is that some
> userspace tooling misinterprets iowait.  Shouldn't we just fix that?

Right, I understand your concerns. That's why by default this flag is
not set and io_uring behaves as before with in_iowait always set.

Unfortunately, "just fix userspace" for us is a huge ask because a whole
pyramid of both code and human understanding has been built on the
current definition of "CPU utilisation". This is extremely time
consuming to change, nor is it something that we (io_uring) want to take
on imo. Why not give the option for people to indicate whether they want
iowait showing up or not?

> 
> It may be that certain (all?) network functions, like recv, should not
> be accounted as iowait.  However, I don't think the onus should be on
> applications to tell the kernel about that--the kernel should just
> figure that out on its own.
> 
> Am I alone in these opinions?

Why should the onus be on the kernel? I think it is more difficult for
the kernel to figure out exactly what semantics userspace wants and it
is simpler for userspace to select their preference.

From my experience, userspace apps either assigns a thread with an
io_uring instance to network IO or disk IO, but never both. If there is
a valid case for doing both types in the same io_uring, then it would be
trivial to add wait helpers that sets IORING_ENTER_NO_IOWAIT on a per
wait basis.

I do agree with you that this is not ideal. What io_uring really wants
is to decouple the iowait accounting from the cpufreq optimisation that
gets enabled in the presence of in_iowait, which is a bigger ask and out
of scope of this patch. When _someone_ decides to fix the wider iowait
issue, I'm happy to revisit this patch.

> 
> Cheers,
> Jeff
>
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 48c440edf674..3a94afa8665e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -508,6 +508,7 @@  struct io_cqring_offsets {
 #define IORING_ENTER_EXT_ARG		(1U << 3)
 #define IORING_ENTER_REGISTERED_RING	(1U << 4)
 #define IORING_ENTER_ABS_TIMER		(1U << 5)
+#define IORING_ENTER_NO_IOWAIT		(1U << 6)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
@@ -543,6 +544,7 @@  struct io_uring_params {
 #define IORING_FEAT_LINKED_FILE		(1U << 12)
 #define IORING_FEAT_REG_REG_RING	(1U << 13)
 #define IORING_FEAT_RECVSEND_BUNDLE	(1U << 14)
+#define IORING_FEAT_IOWAIT_TOGGLE	(1U << 15)
 
 /*
  * io_uring_register(2) opcodes and arguments
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 20229e72b65c..5e75672525df 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2372,7 +2372,7 @@  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	 * can take into account that the task is waiting for IO - turns out
 	 * to be important for low QD IO.
 	 */
-	if (current_pending_io())
+	if (!iowq->no_iowait && current_pending_io())
 		current->in_iowait = 1;
 	ret = 0;
 	if (iowq->timeout == KTIME_MAX)
@@ -2414,6 +2414,7 @@  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
 	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
 	iowq.timeout = KTIME_MAX;
+	iowq.no_iowait = flags & IORING_ENTER_NO_IOWAIT;
 
 	if (uts) {
 		struct timespec64 ts;
@@ -3155,7 +3156,7 @@  SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
 			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
 			       IORING_ENTER_REGISTERED_RING |
-			       IORING_ENTER_ABS_TIMER)))
+			       IORING_ENTER_ABS_TIMER | IORING_ENTER_NO_IOWAIT)))
 		return -EINVAL;
 
 	/*
@@ -3539,7 +3540,7 @@  static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
 			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
 			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
-			IORING_FEAT_RECVSEND_BUNDLE;
+			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_IOWAIT_TOGGLE;
 
 	if (copy_to_user(params, p, sizeof(*p))) {
 		ret = -EFAULT;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9935819f12b7..426079a966ac 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -46,6 +46,7 @@  struct io_wait_queue {
 	ktime_t napi_busy_poll_dt;
 	bool napi_prefer_busy_poll;
 #endif
+	bool no_iowait;
 };
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)