Message ID | f548f142-d6f3-46d8-9c58-6cf595c968fb@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] io_uring: enable toggle of iowait usage when waiting on CQEs | expand |
On 3/14/25 18:48, Jens Axboe wrote: > By default, io_uring marks a waiting task as being in iowait, if it's > sleeping waiting on events and there are pending requests. This isn't > necessarily always useful, and may be confusing on non-storage setups > where iowait isn't expected. It can also cause extra power usage, by I think this passage hints on controlling iowait stats, and in my opinion we shouldn't conflate stats and optimisations. Global iowait stats is there to stay, but ideally we want to never account io_uring as iowait. That's while there were talks about removing optimisation toggle at all (and do it as internal cpufreq magic, I suppose). How about posing it as an optimisation option only and that iowait stat is a side effect that can change. Explicitly spelling that in the commit message and in a comment on top of the flag in an attempt to avoid the uapi regression trap. We'd also need it in the option's man when it's written. And I'd also add "hint" to the flag name, like IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything changes on the cpufreq side.
On 3/16/25 12:57 AM, Pavel Begunkov wrote: > On 3/14/25 18:48, Jens Axboe wrote: >> By default, io_uring marks a waiting task as being in iowait, if it's >> sleeping waiting on events and there are pending requests. This isn't >> necessarily always useful, and may be confusing on non-storage setups >> where iowait isn't expected. It can also cause extra power usage, by > > I think this passage hints on controlling iowait stats, and in my opinion > we shouldn't conflate stats and optimisations. Global iowait stats > is there to stay, but ideally we want to never account io_uring as iowait. > That's while there were talks about removing optimisation toggle at all > (and do it as internal cpufreq magic, I suppose). > > How about posing it as an optimisation option only and that iowait stat > is a side effect that can change. Explicitly spelling that in the commit > message and in a comment on top of the flag in an attempt to avoid the > uapi regression trap. We'd also need it in the option's man when it's > written. And I'd also add "hint" to the flag name, like > IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything > changes on the cpufreq side. Having potentially the control of both would be useful, the stat accounting and the cpufreq boosting. I do think the current name is better, though, the hint doesn't really add anything. I think we'd want to have it be clear on one controlling accounting only. Maybe adding both flagts now would be the better choice, except you'd get -EINVAL if you set IORING_ENTER_IOWAIT_BOOST. And then you'd need two FEAT flags, which is pretty damn wasteful. Hmm...
On 3/17/25 14:07, Jens Axboe wrote: > On 3/16/25 12:57 AM, Pavel Begunkov wrote: >> On 3/14/25 18:48, Jens Axboe wrote: >>> By default, io_uring marks a waiting task as being in iowait, if it's >>> sleeping waiting on events and there are pending requests. This isn't >>> necessarily always useful, and may be confusing on non-storage setups >>> where iowait isn't expected. It can also cause extra power usage, by >> >> I think this passage hints on controlling iowait stats, and in my opinion >> we shouldn't conflate stats and optimisations. Global iowait stats >> is there to stay, but ideally we want to never account io_uring as iowait. >> That's while there were talks about removing optimisation toggle at all >> (and do it as internal cpufreq magic, I suppose). >> >> How about posing it as an optimisation option only and that iowait stat >> is a side effect that can change. Explicitly spelling that in the commit >> message and in a comment on top of the flag in an attempt to avoid the >> uapi regression trap. We'd also need it in the option's man when it's >> written. And I'd also add "hint" to the flag name, like >> IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything >> changes on the cpufreq side. > > Having potentially the control of both would be useful, the stat It's not the right place to control the stat accounting though, apps don't care about iowait, it's usually monitored by a different entity / person from outside the app, so responsibilities don't match. It's fine if you fully control the stack, but just imagine a bunch of apps using different frameworks with io_uring inside that make different choices about it. The final iowait reading would be just a mess. With this patch at least we can say it's an unfortunate side effect. If we can separately control the accounting, a sysctl knob would probably be better, i.e. to be set globally from outside of an app, but I don't think we care enough to add extra logic / overhead for handling it. > accounting and the cpufreq boosting. I do think the current name is > better, though, the hint doesn't really add anything. I think we'd want "Hint" tells the user that it's legit for the kernel to ignore it, including the iowait stat differences the user may see. And we may actually need to drop the flag if task->iowait knob will get hidden from io_uring in the future. The main benefit here is for it to be in the name, because there are always those who don't read comments. > to have it be clear on one controlling accounting only. Maybe adding > both flagts now would be the better choice, except you'd get -EINVAL if > you set IORING_ENTER_IOWAIT_BOOST. And then you'd need two FEAT flags, > which is pretty damn wasteful. > > Hmm... >
On 3/18/25 12:39 AM, Pavel Begunkov wrote: > On 3/17/25 14:07, Jens Axboe wrote: >> On 3/16/25 12:57 AM, Pavel Begunkov wrote: >>> On 3/14/25 18:48, Jens Axboe wrote: >>>> By default, io_uring marks a waiting task as being in iowait, if it's >>>> sleeping waiting on events and there are pending requests. This isn't >>>> necessarily always useful, and may be confusing on non-storage setups >>>> where iowait isn't expected. It can also cause extra power usage, by >>> >>> I think this passage hints on controlling iowait stats, and in my opinion >>> we shouldn't conflate stats and optimisations. Global iowait stats >>> is there to stay, but ideally we want to never account io_uring as iowait. >>> That's while there were talks about removing optimisation toggle at all >>> (and do it as internal cpufreq magic, I suppose). >>> >>> How about posing it as an optimisation option only and that iowait stat >>> is a side effect that can change. Explicitly spelling that in the commit >>> message and in a comment on top of the flag in an attempt to avoid the >>> uapi regression trap. We'd also need it in the option's man when it's >>> written. And I'd also add "hint" to the flag name, like >>> IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything >>> changes on the cpufreq side. >> >> Having potentially the control of both would be useful, the stat > > It's not the right place to control the stat accounting though, > apps don't care about iowait, it's usually monitored by a different > entity / person from outside the app, so responsibilities don't > match. It's fine if you fully control the stack, but just imagine Sometimes those are one and the same thing, though - there's just the one application running. That's not uncommon in data centers. > a bunch of apps using different frameworks with io_uring inside > that make different choices about it. The final iowait reading > would be just a mess. With this patch at least we can say it's > an unfortunate side effect. > If we can separately control the accounting, a sysctl knob would > probably be better, i.e. to be set globally from outside of an > app, but I don't think we care enough to add extra logic / overhead > for handling it. That's not a bad idea, maybe we just do that for starters? We can always introduce per-enter flags for managing boost and/or stats, at least it provides a system wide setting that can just get overridden by flags, should we need it. >> accounting and the cpufreq boosting. I do think the current name is >> better, though, the hint doesn't really add anything. I think we'd want > > "Hint" tells the user that it's legit for the kernel to ignore > it, including the iowait stat differences the user may see. And > we may actually need to drop the flag if task->iowait knob will > get hidden from io_uring in the future. The main benefit here > is for it to be in the name, because there are always those who > don't read comments. But that's the part I have a problem with - sometimes you'd need to know if it's honored or not.
On 3/18/25 18:36, Jens Axboe wrote: > On 3/18/25 12:39 AM, Pavel Begunkov wrote: >> On 3/17/25 14:07, Jens Axboe wrote: >>> On 3/16/25 12:57 AM, Pavel Begunkov wrote: >>>> On 3/14/25 18:48, Jens Axboe wrote: >>>>> By default, io_uring marks a waiting task as being in iowait, if it's >>>>> sleeping waiting on events and there are pending requests. This isn't >>>>> necessarily always useful, and may be confusing on non-storage setups >>>>> where iowait isn't expected. It can also cause extra power usage, by >>>> >>>> I think this passage hints on controlling iowait stats, and in my opinion >>>> we shouldn't conflate stats and optimisations. Global iowait stats >>>> is there to stay, but ideally we want to never account io_uring as iowait. >>>> That's while there were talks about removing optimisation toggle at all >>>> (and do it as internal cpufreq magic, I suppose). >>>> >>>> How about posing it as an optimisation option only and that iowait stat >>>> is a side effect that can change. Explicitly spelling that in the commit >>>> message and in a comment on top of the flag in an attempt to avoid the >>>> uapi regression trap. We'd also need it in the option's man when it's >>>> written. And I'd also add "hint" to the flag name, like >>>> IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything >>>> changes on the cpufreq side. >>> >>> Having potentially the control of both would be useful, the stat >> >> It's not the right place to control the stat accounting though, >> apps don't care about iowait, it's usually monitored by a different >> entity / person from outside the app, so responsibilities don't >> match. It's fine if you fully control the stack, but just imagine > > Sometimes those are one and the same thing, though - there's just the > one application running. That's not uncommon in data centers. Yep, but that's only a subset, and for others the very fact of the feature existence creates a mess, which might be fine or not. >> a bunch of apps using different frameworks with io_uring inside >> that make different choices about it. The final iowait reading >> would be just a mess. With this patch at least we can say it's >> an unfortunate side effect. >> If we can separately control the accounting, a sysctl knob would >> probably be better, i.e. to be set globally from outside of an >> app, but I don't think we care enough to add extra logic / overhead >> for handling it. > > That's not a bad idea, maybe we just do that for starters? We can always Do we really want it though? What are you trying to achieve, fixing the iowait stat problem or providing an optimisation option? Because as I see it, what's good for one is bad for the other, unfortunately. A sysctl is not a great option as an optimisation, because with that all apps in the system has either to be storage or net to be optimal in relation to iowait / power consumption. That one you won't even be able to use in a good number of server setups while getting optimal power consumption, even if you own the entire stack. It sounds to me like the best option is to choose which one we want to solve at the moment. Global / sysctl option for the stat, but I'm not sure it's that important atm, people complain less nowadays as well. Enter flag goes fine for the iowait optimisation, but messes with the stat. IMHO, that should be fine if we're clear about it and that the stat part of it can change. That's what I'd suggest doing. The third option is to try to solve them both, but seems your patches got buried in a discussion, and working it around at io_uring side doesn't sound pretty, like two flags you mentioned. Another option is to just have v2 and tell that the optimisation and the accounting is the same, having some mess on the stat side, and deal with the consequences when the in-kernel semantics changes. > introduce per-enter flags for managing boost and/or stats, at least it > provides a system wide setting that can just get overridden by flags, > should we need it. > >>> accounting and the cpufreq boosting. I do think the current name is >>> better, though, the hint doesn't really add anything. I think we'd want >> >> "Hint" tells the user that it's legit for the kernel to ignore >> it, including the iowait stat differences the user may see. And >> we may actually need to drop the flag if task->iowait knob will >> get hidden from io_uring in the future. The main benefit here >> is for it to be in the name, because there are always those who >> don't read comments. > > But that's the part I have a problem with - sometimes you'd need to know > if it's honored or not. If the flag implements optimisation only part of iowait, I don't think it's so necessary, but we can add some slow path for querying if it has the iowait stat "side effect". If it's about the stat, yeah, probably we can't just ignore it and "hint" is not a good idea.
On 3/18/25 2:07 PM, Pavel Begunkov wrote: > On 3/18/25 18:36, Jens Axboe wrote: >> On 3/18/25 12:39 AM, Pavel Begunkov wrote: >>> On 3/17/25 14:07, Jens Axboe wrote: >>>> On 3/16/25 12:57 AM, Pavel Begunkov wrote: >>>>> On 3/14/25 18:48, Jens Axboe wrote: >>>>>> By default, io_uring marks a waiting task as being in iowait, if it's >>>>>> sleeping waiting on events and there are pending requests. This isn't >>>>>> necessarily always useful, and may be confusing on non-storage setups >>>>>> where iowait isn't expected. It can also cause extra power usage, by >>>>> >>>>> I think this passage hints on controlling iowait stats, and in my opinion >>>>> we shouldn't conflate stats and optimisations. Global iowait stats >>>>> is there to stay, but ideally we want to never account io_uring as iowait. >>>>> That's while there were talks about removing optimisation toggle at all >>>>> (and do it as internal cpufreq magic, I suppose). >>>>> >>>>> How about posing it as an optimisation option only and that iowait stat >>>>> is a side effect that can change. Explicitly spelling that in the commit >>>>> message and in a comment on top of the flag in an attempt to avoid the >>>>> uapi regression trap. We'd also need it in the option's man when it's >>>>> written. And I'd also add "hint" to the flag name, like >>>>> IORING_ENTER_HINT_NO_IOWAIT, as we might need to nop it if anything >>>>> changes on the cpufreq side. >>>> >>>> Having potentially the control of both would be useful, the stat >>> >>> It's not the right place to control the stat accounting though, >>> apps don't care about iowait, it's usually monitored by a different >>> entity / person from outside the app, so responsibilities don't >>> match. It's fine if you fully control the stack, but just imagine >> >> Sometimes those are one and the same thing, though - there's just the >> one application running. That's not uncommon in data centers. > > Yep, but that's only a subset, and for others the very fact of the > feature existence creates a mess, which might be fine or not. But at least with an opt-out flag, if you don't care, you never need to worry about it. >>> a bunch of apps using different frameworks with io_uring inside >>> that make different choices about it. The final iowait reading >>> would be just a mess. With this patch at least we can say it's >>> an unfortunate side effect. >>> If we can separately control the accounting, a sysctl knob would >>> probably be better, i.e. to be set globally from outside of an >>> app, but I don't think we care enough to add extra logic / overhead >>> for handling it. >> >> That's not a bad idea, maybe we just do that for starters? We can always > > Do we really want it though? What are you trying to achieve, fixing > the iowait stat problem or providing an optimisation option? Because > as I see it, what's good for one is bad for the other, unfortunately. > A sysctl is not a great option as an optimisation, because with that > all apps in the system has either to be storage or net to be optimal > in relation to iowait / power consumption. That one you won't even > be able to use in a good number of server setups while getting > optimal power consumption, even if you own the entire stack. > > It sounds to me like the best option is to choose which one we want > to solve at the moment. Global / sysctl option for the stat, but I'm > not sure it's that important atm, people complain less nowadays > as well. Enter flag goes fine for the iowait optimisation, but > messes with the stat. IMHO, that should be fine if we're clear > about it and that the stat part of it can change. That's what > I'd suggest doing. > > The third option is to try to solve them both, but seems your > patches got buried in a discussion, and working it around at > io_uring side doesn't sound pretty, like two flags you > mentioned. I'm not digging into that again. Once those guys figure out what they want, we can address it on our side. > Another option is to just have v2 and tell that the optimisation > and the accounting is the same, having some mess on the stat > side, and deal with the consequences when the in-kernel semantics > changes. After thinking about it a bit more, I do think v2 is the best approach. And the name is probably fine, IORING_ENTER_NO_IOWAIT. If we at some point end up having the ability to control boost and stats separately, we could add IORING_ENTER_IOWAIT_BOOST or something. That'll allow you to control both separately. What do you think? I do want to get this sorted for 6.15, I feel like we've been ignoring or stalling on this issue for way too long.
On 3/19/25 01:54, Jens Axboe wrote: ... >> Do we really want it though? What are you trying to achieve, fixing >> the iowait stat problem or providing an optimisation option? Because >> as I see it, what's good for one is bad for the other, unfortunately. >> A sysctl is not a great option as an optimisation, because with that >> all apps in the system has either to be storage or net to be optimal >> in relation to iowait / power consumption. That one you won't even >> be able to use in a good number of server setups while getting >> optimal power consumption, even if you own the entire stack. >> >> It sounds to me like the best option is to choose which one we want >> to solve at the moment. Global / sysctl option for the stat, but I'm >> not sure it's that important atm, people complain less nowadays >> as well. Enter flag goes fine for the iowait optimisation, but >> messes with the stat. IMHO, that should be fine if we're clear >> about it and that the stat part of it can change. That's what >> I'd suggest doing. >> >> The third option is to try to solve them both, but seems your >> patches got buried in a discussion, and working it around at >> io_uring side doesn't sound pretty, like two flags you >> mentioned. > > I'm not digging into that again. Once those guys figure out what they > want, we can address it on our side. > >> Another option is to just have v2 and tell that the optimisation >> and the accounting is the same, having some mess on the stat >> side, and deal with the consequences when the in-kernel semantics >> changes. > > After thinking about it a bit more, I do think v2 is the best approach. > And the name is probably fine, IORING_ENTER_NO_IOWAIT. If we at some > point end up having the ability to control boost and stats separately, > we could add IORING_ENTER_IOWAIT_BOOST or something. That'll allow you > to control both separately. After a private discussion, apparently you do care about the iowait stat, and there are just too many users that complain about it and tools that misinterpret it. All that "hint" and "side effect" discussion of mine doesn't solve that. I think we do agree that it's a poor interface, but there is no good alternative, not on the io_uring side, and it's pretty sad that we're forced into a bad design because the patches actually solving the problem are blocked for a year now by sched/cpufreq. > What do you think? I do want to get this sorted for 6.15, I feel like > we've been ignoring or stalling on this issue for way too long. >
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 050fa8eb2e8f..0d6c83c8d1cf 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -541,6 +541,7 @@ struct io_cqring_offsets { #define IORING_ENTER_REGISTERED_RING (1U << 4) #define IORING_ENTER_ABS_TIMER (1U << 5) #define IORING_ENTER_EXT_ARG_REG (1U << 6) +#define IORING_ENTER_NO_IOWAIT (1U << 7) /* * Passed in for io_uring_setup(2). Copied back with updated info on success @@ -578,6 +579,7 @@ struct io_uring_params { #define IORING_FEAT_RECVSEND_BUNDLE (1U << 14) #define IORING_FEAT_MIN_TIMEOUT (1U << 15) #define IORING_FEAT_RW_ATTR (1U << 16) +#define IORING_FEAT_NO_IOWAIT (1U << 17) /* * io_uring_register(2) opcodes and arguments diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 58003fa6b327..d975e68e91f2 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2485,8 +2485,18 @@ static int io_cqring_schedule_timeout(struct io_wait_queue *iowq, return READ_ONCE(iowq->hit_timeout) ? -ETIME : 0; } +struct ext_arg { + size_t argsz; + struct timespec64 ts; + const sigset_t __user *sig; + ktime_t min_time; + bool ts_set; + bool iowait; +}; + static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, struct io_wait_queue *iowq, + struct ext_arg *ext_arg, ktime_t start_time) { int ret = 0; @@ -2496,7 +2506,7 @@ static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, * can take into account that the task is waiting for IO - turns out * to be important for low QD IO. */ - if (current_pending_io()) + if (ext_arg->iowait && current_pending_io()) current->in_iowait = 1; if (iowq->timeout != KTIME_MAX || iowq->min_timeout) ret = io_cqring_schedule_timeout(iowq, ctx->clockid, start_time); @@ -2509,6 +2519,7 @@ static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, /* If this returns > 0, the caller should retry */ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, struct io_wait_queue *iowq, + struct ext_arg *ext_arg, ktime_t start_time) { if (unlikely(READ_ONCE(ctx->check_cq))) @@ -2522,17 +2533,9 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, if (unlikely(io_should_wake(iowq))) return 0; - return __io_cqring_wait_schedule(ctx, iowq, start_time); + return __io_cqring_wait_schedule(ctx, iowq, ext_arg, start_time); } -struct ext_arg { - size_t argsz; - struct timespec64 ts; - const sigset_t __user *sig; - ktime_t min_time; - bool ts_set; -}; - /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2610,7 +2613,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, TASK_INTERRUPTIBLE); } - ret = io_cqring_wait_schedule(ctx, &iowq, start_time); + ret = io_cqring_wait_schedule(ctx, &iowq, ext_arg, start_time); __set_current_state(TASK_RUNNING); atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT); @@ -3261,6 +3264,8 @@ static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned flags, const struct io_uring_getevents_arg __user *uarg = argp; struct io_uring_getevents_arg arg; + ext_arg->iowait = !(flags & IORING_ENTER_NO_IOWAIT); + /* * If EXT_ARG isn't set, then we have no timespec and the argp pointer * is just a pointer to the sigset_t. @@ -3338,7 +3343,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG | IORING_ENTER_REGISTERED_RING | IORING_ENTER_ABS_TIMER | - IORING_ENTER_EXT_ARG_REG))) + IORING_ENTER_EXT_ARG_REG | + IORING_ENTER_NO_IOWAIT))) return -EINVAL; /* @@ -3752,7 +3758,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP | IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING | IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT | - IORING_FEAT_RW_ATTR; + IORING_FEAT_RW_ATTR | IORING_FEAT_NO_IOWAIT; if (copy_to_user(params, p, sizeof(*p))) { ret = -EFAULT;
By default, io_uring marks a waiting task as being in iowait, if it's sleeping waiting on events and there are pending requests. This isn't necessarily always useful, and may be confusing on non-storage setups where iowait isn't expected. It can also cause extra power usage, by preventing the CPU from entering lower sleep states. This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then io_uring will not mark the sleeping task as being in iowait. If the kernel support this feature, then it will be marked by having the IORING_FEAT_NO_IOWAIT feature flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- Basic liburing support and a test case here: https://git.kernel.dk/cgit/liburing/log/?h=iowait Since v1: - Add IORING_ENTER_NO_IOWAIT feature flag