Message ID | cover.1569139018.git.asml.silence@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Optimise io_uring completion waiting | expand |
On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote: > From: Pavel Begunkov <asml.silence@gmail.com> > > There could be a lot of overhead within generic wait_event_*() used for > waiting for large number of completions. The patchset removes much of > it by using custom wait event (wait_threshold). > > Synthetic test showed ~40% performance boost. (see patch 2) I'm fine with the io_uring side of things, but to queue this up we really need Peter or Ingo to sign off on the core wakeup bits... Peter?
* Jens Axboe <axboe@kernel.dk> wrote: > On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote: > > From: Pavel Begunkov <asml.silence@gmail.com> > > > > There could be a lot of overhead within generic wait_event_*() used for > > waiting for large number of completions. The patchset removes much of > > it by using custom wait event (wait_threshold). > > > > Synthetic test showed ~40% performance boost. (see patch 2) > > I'm fine with the io_uring side of things, but to queue this up we > really need Peter or Ingo to sign off on the core wakeup bits... > > Peter? I'm not sure an extension is needed for such a special interface, why not just put a ->threshold value next to the ctx->wait field and use either the regular wait_event() APIs with the proper condition, or wait_event_cmd() style APIs if you absolutely need something more complex to happen inside? Should result in a much lower linecount and no scheduler changes. :-) Thanks, Ingo
Hi, and thanks for the feedback. It could be done with @cond indeed, that's how it works for now. However, this addresses performance issues only. The problem with wait_event_*() is that, if we have a counter and are trying to wake up tasks after each increment, it would schedule each waiting task O(threshold) times just for it to spuriously check @cond and go back to sleep. All that overhead (memory barriers, registers save/load, accounting, etc) turned out to be enough for some workloads to slow down the system. With this specialisation it still traverses a wait list and makes indirect calls to the checker callback, but the list supposedly is fairly small, so performance there shouldn't be a problem, at least for now. Regarding semantics; It should wake a task when a value passed to wake_up_threshold() is greater or equal then a task's threshold, that is specified individually for each task in wait_threshold_*(). In pseudo code: ``` def wake_up_threshold(n, wait_queue): for waiter in wait_queue: waiter.wake_up_if(n >= waiter.threshold); ``` Any thoughts how to do it better? Ideas are very welcome. BTW, this monster is mostly a copy-paste from wait_event_*(), wait_bit_*(). We could try to extract some common parts from these three, but that's another topic. On 23/09/2019 11:35, Ingo Molnar wrote: > > * Jens Axboe <axboe@kernel.dk> wrote: > >> On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote: >>> From: Pavel Begunkov <asml.silence@gmail.com> >>> >>> There could be a lot of overhead within generic wait_event_*() used for >>> waiting for large number of completions. The patchset removes much of >>> it by using custom wait event (wait_threshold). >>> >>> Synthetic test showed ~40% performance boost. (see patch 2) >> >> I'm fine with the io_uring side of things, but to queue this up we >> really need Peter or Ingo to sign off on the core wakeup bits... >> >> Peter? > > I'm not sure an extension is needed for such a special interface, why not > just put a ->threshold value next to the ctx->wait field and use either > the regular wait_event() APIs with the proper condition, or > wait_event_cmd() style APIs if you absolutely need something more complex > to happen inside? > > Should result in a much lower linecount and no scheduler changes. :-) > > Thanks, > > Ingo >
Sorry, mixed the threads. >> >> I'm not sure an extension is needed for such a special interface, why not >> just put a ->threshold value next to the ctx->wait field and use either >> the regular wait_event() APIs with the proper condition, or >> wait_event_cmd() style APIs if you absolutely need something more complex >> to happen inside? Ingo, io_uring works well without this patch just using wait_event_*() with proper condition, but there are performance issues with spurious wakeups. Detailed description in the previous mail. Am I missing something? Thanks >> >> Should result in a much lower linecount and no scheduler changes. :-) >> >> Thanks, >> >> Ingo >> >
On 9/23/19 10:32 AM, Pavel Begunkov wrote: > Sorry, mixed the threads. > >>> >>> I'm not sure an extension is needed for such a special interface, why not >>> just put a ->threshold value next to the ctx->wait field and use either >>> the regular wait_event() APIs with the proper condition, or >>> wait_event_cmd() style APIs if you absolutely need something more complex >>> to happen inside? > Ingo, > io_uring works well without this patch just using wait_event_*() with > proper condition, but there are performance issues with spurious > wakeups. Detailed description in the previous mail. > Am I missing something? I think we can do the same thing, just wrapping the waitqueue in a structure with a count in it, on the stack. Got some flight time coming up later today, let me try and cook up a patch.
On 9/23/19 2:48 PM, Jens Axboe wrote: > On 9/23/19 10:32 AM, Pavel Begunkov wrote: >> Sorry, mixed the threads. >> >>>> >>>> I'm not sure an extension is needed for such a special interface, why not >>>> just put a ->threshold value next to the ctx->wait field and use either >>>> the regular wait_event() APIs with the proper condition, or >>>> wait_event_cmd() style APIs if you absolutely need something more complex >>>> to happen inside? >> Ingo, >> io_uring works well without this patch just using wait_event_*() with >> proper condition, but there are performance issues with spurious >> wakeups. Detailed description in the previous mail. >> Am I missing something? > > I think we can do the same thing, just wrapping the waitqueue in a > structure with a count in it, on the stack. Got some flight time > coming up later today, let me try and cook up a patch. Totally untested, and sent out 5 min before departure... But something like this. diff --git a/fs/io_uring.c b/fs/io_uring.c index ca7570aca430..c2f9e1da26dd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; } +struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + struct task_struct *task; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (io_should_wake(iowq)) { + list_del_init(&curr->entry); + wake_up_process(iowq->task); + return 1; + } + + return -1; +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .task = current, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret; if (io_cqring_events(rings) >= min_events) @@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } - nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + do { + if (io_should_wake(&iowq)) + break; + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
On 24/09/2019 02:00, Jens Axboe wrote: >> I think we can do the same thing, just wrapping the waitqueue in a >> structure with a count in it, on the stack. Got some flight time >> coming up later today, let me try and cook up a patch. > > Totally untested, and sent out 5 min before departure... But something > like this. Hmm, reminds me my first version. Basically that's the same thing but with macroses inlined. I wanted to make it reusable and self-contained, though. If you don't think it could be useful in other places, sure, we could do something like that. Is that so? > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index ca7570aca430..c2f9e1da26dd 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > return submit; > } > > +struct io_wait_queue { > + struct wait_queue_entry wq; > + struct io_ring_ctx *ctx; > + struct task_struct *task; > + unsigned to_wait; > + unsigned nr_timeouts; > +}; > + > +static inline bool io_should_wake(struct io_wait_queue *iowq) > +{ > + struct io_ring_ctx *ctx = iowq->ctx; > + > + return io_cqring_events(ctx->rings) >= iowq->to_wait || > + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; > +} > + > +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > + int wake_flags, void *key) > +{ > + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, > + wq); > + > + if (io_should_wake(iowq)) { > + list_del_init(&curr->entry); > + wake_up_process(iowq->task); > + return 1; > + } > + > + return -1; > +} > + > /* > * Wait until events become available, if we don't already have some. The > * application must reap them itself, as they reside on the shared cq ring. > @@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > const sigset_t __user *sig, size_t sigsz) > { > + struct io_wait_queue iowq = { > + .wq = { > + .func = io_wake_function, > + .entry = LIST_HEAD_INIT(iowq.wq.entry), > + }, > + .task = current, > + .ctx = ctx, > + .to_wait = min_events, > + }; > struct io_rings *rings = ctx->rings; > - unsigned nr_timeouts; > int ret; > > if (io_cqring_events(rings) >= min_events) > @@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > return ret; > } > > - nr_timeouts = atomic_read(&ctx->cq_timeouts); > - /* > - * Return if we have enough events, or if a timeout occured since > - * we started waiting. For timeouts, we always want to return to > - * userspace. > - */ > - ret = wait_event_interruptible(ctx->wait, > - io_cqring_events(rings) >= min_events || > - atomic_read(&ctx->cq_timeouts) != nr_timeouts); > + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); > + do { > + if (io_should_wake(&iowq)) > + break; > + schedule(); > + set_current_state(TASK_INTERRUPTIBLE); > + } while (1); > + finish_wait(&ctx->wait, &iowq.wq); > + > restore_saved_sigmask_unless(ret == -ERESTARTSYS); > if (ret == -ERESTARTSYS) > ret = -EINTR; >
On 9/24/19 1:06 AM, Pavel Begunkov wrote: > On 24/09/2019 02:00, Jens Axboe wrote: >>> I think we can do the same thing, just wrapping the waitqueue in a >>> structure with a count in it, on the stack. Got some flight time >>> coming up later today, let me try and cook up a patch. >> >> Totally untested, and sent out 5 min before departure... But something >> like this. > Hmm, reminds me my first version. Basically that's the same thing but > with macroses inlined. I wanted to make it reusable and self-contained, > though. > > If you don't think it could be useful in other places, sure, we could do > something like that. Is that so? I totally agree it could be useful in other places. Maybe formalized and used with wake_up_nr() instead of adding a new primitive? Haven't looked into that, I may be talking nonsense. In any case, I did get a chance to test it and it works for me. Here's the "finished" version, slightly cleaned up and with a comment added for good measure. diff --git a/fs/io_uring.c b/fs/io_uring.c index ca7570aca430..14fae454cf75 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; } +struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + struct task_struct *task; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + /* + * Wake up if we have enough events, or if a timeout occured since we + * started waiting. For timeouts, we always want to return to userspace, + * regardless of event count. + */ + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (io_should_wake(iowq)) { + list_del_init(&curr->entry); + wake_up_process(iowq->task); + return 1; + } + + return -1; +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .task = current, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret; if (io_cqring_events(rings) >= min_events) @@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } - nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + do { + if (io_should_wake(&iowq)) + break; + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
On 9/24/19 2:02 AM, Jens Axboe wrote: > On 9/24/19 1:06 AM, Pavel Begunkov wrote: >> On 24/09/2019 02:00, Jens Axboe wrote: >>>> I think we can do the same thing, just wrapping the waitqueue in a >>>> structure with a count in it, on the stack. Got some flight time >>>> coming up later today, let me try and cook up a patch. >>> >>> Totally untested, and sent out 5 min before departure... But something >>> like this. >> Hmm, reminds me my first version. Basically that's the same thing but >> with macroses inlined. I wanted to make it reusable and self-contained, >> though. >> >> If you don't think it could be useful in other places, sure, we could do >> something like that. Is that so? > > I totally agree it could be useful in other places. Maybe formalized and > used with wake_up_nr() instead of adding a new primitive? Haven't looked > into that, I may be talking nonsense. > > In any case, I did get a chance to test it and it works for me. Here's > the "finished" version, slightly cleaned up and with a comment added > for good measure. Notes: This version gets the ordering right, you need exclusive waits to get fifo ordering on the waitqueue. Both versions (yours and mine) suffer from the problem of potentially waking too many. I don't think this is a real issue, as generally we don't do threaded access to the io_urings. But if you had the following tasks wait on the cqring: [min_events = 32], [min_events = 8], [min_events = 8] and we reach the io_cqring_events() == threshold, we'll wake all three. I don't see a good solution to this, so I suspect we just live with until proven an issue. Both versions are much better than what we have now.
On 9/24/19 2:27 AM, Jens Axboe wrote: > On 9/24/19 2:02 AM, Jens Axboe wrote: >> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>> structure with a count in it, on the stack. Got some flight time >>>>> coming up later today, let me try and cook up a patch. >>>> >>>> Totally untested, and sent out 5 min before departure... But something >>>> like this. >>> Hmm, reminds me my first version. Basically that's the same thing but >>> with macroses inlined. I wanted to make it reusable and self-contained, >>> though. >>> >>> If you don't think it could be useful in other places, sure, we could do >>> something like that. Is that so? >> >> I totally agree it could be useful in other places. Maybe formalized and >> used with wake_up_nr() instead of adding a new primitive? Haven't looked >> into that, I may be talking nonsense. >> >> In any case, I did get a chance to test it and it works for me. Here's >> the "finished" version, slightly cleaned up and with a comment added >> for good measure. > > Notes: > > This version gets the ordering right, you need exclusive waits to get > fifo ordering on the waitqueue. > > Both versions (yours and mine) suffer from the problem of potentially > waking too many. I don't think this is a real issue, as generally we > don't do threaded access to the io_urings. But if you had the following > tasks wait on the cqring: > > [min_events = 32], [min_events = 8], [min_events = 8] > > and we reach the io_cqring_events() == threshold, we'll wake all three. > I don't see a good solution to this, so I suspect we just live with > until proven an issue. Both versions are much better than what we have > now. Forgot an issue around signal handling, version below adds the right check for that too. Curious what your test case was for this? diff --git a/fs/io_uring.c b/fs/io_uring.c index ca7570aca430..3fbab5692f14 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; } +struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + struct task_struct *task; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + /* + * Wake up if we have enough events, or if a timeout occured since we + * started waiting. For timeouts, we always want to return to userspace, + * regardless of event count. + */ + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (io_should_wake(iowq)) { + list_del_init(&curr->entry); + wake_up_process(iowq->task); + return 1; + } + + return -1; +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .task = current, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret; if (io_cqring_events(rings) >= min_events) @@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } - nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + do { + if (io_should_wake(&iowq)) + break; + schedule(); + if (signal_pending(current)) + break; + set_current_state(TASK_INTERRUPTIBLE); + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
On 24/09/2019 11:27, Jens Axboe wrote: > On 9/24/19 2:02 AM, Jens Axboe wrote: >> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>> structure with a count in it, on the stack. Got some flight time >>>>> coming up later today, let me try and cook up a patch. >>>> >>>> Totally untested, and sent out 5 min before departure... But something >>>> like this. >>> Hmm, reminds me my first version. Basically that's the same thing but >>> with macroses inlined. I wanted to make it reusable and self-contained, >>> though. >>> >>> If you don't think it could be useful in other places, sure, we could do >>> something like that. Is that so? >> >> I totally agree it could be useful in other places. Maybe formalized and >> used with wake_up_nr() instead of adding a new primitive? Haven't looked >> into that, I may be talking nonsense. >> >> In any case, I did get a chance to test it and it works for me. Here's >> the "finished" version, slightly cleaned up and with a comment added >> for good measure. > > Notes: > > This version gets the ordering right, you need exclusive waits to get > fifo ordering on the waitqueue. > > Both versions (yours and mine) suffer from the problem of potentially > waking too many. I don't think this is a real issue, as generally we > don't do threaded access to the io_urings. But if you had the following > tasks wait on the cqring: > > [min_events = 32], [min_events = 8], [min_events = 8] > > and we reach the io_cqring_events() == threshold, we'll wake all three. > I don't see a good solution to this, so I suspect we just live with > until proven an issue. Both versions are much better than what we have > now. > If io_cqring_events() == 8, only the last two would be woken up in both implementations, as to_wait/threshold is specified per waiter. Isn't it? Agree with waiting, I don't see a good real-life case for that, that couldn't be done efficiently in userspace.
On 24/09/2019 11:02, Jens Axboe wrote: > On 9/24/19 1:06 AM, Pavel Begunkov wrote: >> On 24/09/2019 02:00, Jens Axboe wrote: >>>> I think we can do the same thing, just wrapping the waitqueue in a >>>> structure with a count in it, on the stack. Got some flight time >>>> coming up later today, let me try and cook up a patch. >>> >>> Totally untested, and sent out 5 min before departure... But something >>> like this. >> Hmm, reminds me my first version. Basically that's the same thing but >> with macroses inlined. I wanted to make it reusable and self-contained, >> though. >> >> If you don't think it could be useful in other places, sure, we could do >> something like that. Is that so? > > I totally agree it could be useful in other places. Maybe formalized and > used with wake_up_nr() instead of adding a new primitive? Haven't looked > into that, I may be talking nonsense. @nr there is about number of tasks to wake up. AFAIK doesn't solve the problem. > > In any case, I did get a chance to test it and it works for me. Here's > the "finished" version, slightly cleaned up and with a comment added > for good measure. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index ca7570aca430..14fae454cf75 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > return submit; > } > > +struct io_wait_queue { > + struct wait_queue_entry wq; > + struct io_ring_ctx *ctx; > + struct task_struct *task; > + unsigned to_wait; > + unsigned nr_timeouts; > +}; > + > +static inline bool io_should_wake(struct io_wait_queue *iowq) > +{ > + struct io_ring_ctx *ctx = iowq->ctx; > + > + /* > + * Wake up if we have enough events, or if a timeout occured since we > + * started waiting. For timeouts, we always want to return to userspace, > + * regardless of event count. > + */ > + return io_cqring_events(ctx->rings) >= iowq->to_wait || > + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; > +} > + > +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > + int wake_flags, void *key) > +{ > + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, > + wq); > + > + if (io_should_wake(iowq)) { > + list_del_init(&curr->entry); > + wake_up_process(iowq->task); > + return 1; > + } > + > + return -1; > +} > + > /* > * Wait until events become available, if we don't already have some. The > * application must reap them itself, as they reside on the shared cq ring. > @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > const sigset_t __user *sig, size_t sigsz) > { > + struct io_wait_queue iowq = { > + .wq = { > + .func = io_wake_function, > + .entry = LIST_HEAD_INIT(iowq.wq.entry), > + }, > + .task = current, > + .ctx = ctx, > + .to_wait = min_events, > + }; > struct io_rings *rings = ctx->rings; > - unsigned nr_timeouts; > int ret; > > if (io_cqring_events(rings) >= min_events) > @@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > return ret; > } > > - nr_timeouts = atomic_read(&ctx->cq_timeouts); > - /* > - * Return if we have enough events, or if a timeout occured since > - * we started waiting. For timeouts, we always want to return to > - * userspace. > - */ > - ret = wait_event_interruptible(ctx->wait, > - io_cqring_events(rings) >= min_events || > - atomic_read(&ctx->cq_timeouts) != nr_timeouts); > + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); > + do { > + if (io_should_wake(&iowq)) > + break; > + schedule(); > + set_current_state(TASK_INTERRUPTIBLE); > + } while (1); > + finish_wait(&ctx->wait, &iowq.wq); > + > restore_saved_sigmask_unless(ret == -ERESTARTSYS); > if (ret == -ERESTARTSYS) > ret = -EINTR; >
On 24/09/2019 11:36, Jens Axboe wrote: > On 9/24/19 2:27 AM, Jens Axboe wrote: >> On 9/24/19 2:02 AM, Jens Axboe wrote: >>> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>>> structure with a count in it, on the stack. Got some flight time >>>>>> coming up later today, let me try and cook up a patch. >>>>> >>>>> Totally untested, and sent out 5 min before departure... But something >>>>> like this. >>>> Hmm, reminds me my first version. Basically that's the same thing but >>>> with macroses inlined. I wanted to make it reusable and self-contained, >>>> though. >>>> >>>> If you don't think it could be useful in other places, sure, we could do >>>> something like that. Is that so? >>> >>> I totally agree it could be useful in other places. Maybe formalized and >>> used with wake_up_nr() instead of adding a new primitive? Haven't looked >>> into that, I may be talking nonsense. >>> >>> In any case, I did get a chance to test it and it works for me. Here's >>> the "finished" version, slightly cleaned up and with a comment added >>> for good measure. >> >> Notes: >> >> This version gets the ordering right, you need exclusive waits to get >> fifo ordering on the waitqueue. >> >> Both versions (yours and mine) suffer from the problem of potentially >> waking too many. I don't think this is a real issue, as generally we >> don't do threaded access to the io_urings. But if you had the following >> tasks wait on the cqring: >> >> [min_events = 32], [min_events = 8], [min_events = 8] >> >> and we reach the io_cqring_events() == threshold, we'll wake all three. >> I don't see a good solution to this, so I suspect we just live with >> until proven an issue. Both versions are much better than what we have >> now. > > Forgot an issue around signal handling, version below adds the > right check for that too. It seems to be a good reason to not keep reimplementing "prepare_to_wait*() + wait loop" every time, but keep it in sched :) > > Curious what your test case was for this? You mean a performance test case? It's briefly described in a comment for the second patch. That's just rewritten io_uring-bench, with 1. a thread generating 1 request per call in a loop 2. and the second thread waiting for ~128 events. Both are pinned to the same core. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index ca7570aca430..3fbab5692f14 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > return submit; > } > > +struct io_wait_queue { > + struct wait_queue_entry wq; > + struct io_ring_ctx *ctx; > + struct task_struct *task; > + unsigned to_wait; > + unsigned nr_timeouts; > +}; > + > +static inline bool io_should_wake(struct io_wait_queue *iowq) > +{ > + struct io_ring_ctx *ctx = iowq->ctx; > + > + /* > + * Wake up if we have enough events, or if a timeout occured since we > + * started waiting. For timeouts, we always want to return to userspace, > + * regardless of event count. > + */ > + return io_cqring_events(ctx->rings) >= iowq->to_wait || > + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; > +} > + > +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > + int wake_flags, void *key) > +{ > + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, > + wq); > + > + if (io_should_wake(iowq)) { > + list_del_init(&curr->entry); > + wake_up_process(iowq->task); > + return 1; > + } > + > + return -1; > +} > + > /* > * Wait until events become available, if we don't already have some. The > * application must reap them itself, as they reside on the shared cq ring. > @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > const sigset_t __user *sig, size_t sigsz) > { > + struct io_wait_queue iowq = { > + .wq = { > + .func = io_wake_function, > + .entry = LIST_HEAD_INIT(iowq.wq.entry), > + }, > + .task = current, > + .ctx = ctx, > + .to_wait = min_events, > + }; > struct io_rings *rings = ctx->rings; > - unsigned nr_timeouts; > int ret; > > if (io_cqring_events(rings) >= min_events) > @@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > return ret; > } > > - nr_timeouts = atomic_read(&ctx->cq_timeouts); > - /* > - * Return if we have enough events, or if a timeout occured since > - * we started waiting. For timeouts, we always want to return to > - * userspace. > - */ > - ret = wait_event_interruptible(ctx->wait, > - io_cqring_events(rings) >= min_events || > - atomic_read(&ctx->cq_timeouts) != nr_timeouts); > + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); > + do { > + if (io_should_wake(&iowq)) > + break; > + schedule(); > + if (signal_pending(current)) > + break; > + set_current_state(TASK_INTERRUPTIBLE); > + } while (1); > + finish_wait(&ctx->wait, &iowq.wq); > + > restore_saved_sigmask_unless(ret == -ERESTARTSYS); > if (ret == -ERESTARTSYS) > ret = -EINTR; >
On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: > +struct io_wait_queue { > + struct wait_queue_entry wq; > + struct io_ring_ctx *ctx; > + struct task_struct *task; wq.private is where the normal waitqueue stores the task pointer. (I'm going to rename that) > + unsigned to_wait; > + unsigned nr_timeouts; > +}; > + > +static inline bool io_should_wake(struct io_wait_queue *iowq) > +{ > + struct io_ring_ctx *ctx = iowq->ctx; > + > + /* > + * Wake up if we have enough events, or if a timeout occured since we > + * started waiting. For timeouts, we always want to return to userspace, > + * regardless of event count. > + */ > + return io_cqring_events(ctx->rings) >= iowq->to_wait || > + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; > +} > + > +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > + int wake_flags, void *key) > +{ > + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, > + wq); > + > + if (io_should_wake(iowq)) { > + list_del_init(&curr->entry); > + wake_up_process(iowq->task); Then you can use autoremove_wake_function() here. > + return 1; > + } > + > + return -1; > +} Ideally we'd get wait_event()'s @cond in a custom wake function. Then we can _always_ do this. This is one I'd love to have lambda functions for. It would actually work with GCC nested functions, because the wake function will always be in scope, but we can't use those in the kernel for other reasons :/
On 9/24/19 3:20 AM, Pavel Begunkov wrote: > > > On 24/09/2019 11:27, Jens Axboe wrote: >> On 9/24/19 2:02 AM, Jens Axboe wrote: >>> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>>> structure with a count in it, on the stack. Got some flight time >>>>>> coming up later today, let me try and cook up a patch. >>>>> >>>>> Totally untested, and sent out 5 min before departure... But something >>>>> like this. >>>> Hmm, reminds me my first version. Basically that's the same thing but >>>> with macroses inlined. I wanted to make it reusable and self-contained, >>>> though. >>>> >>>> If you don't think it could be useful in other places, sure, we could do >>>> something like that. Is that so? >>> >>> I totally agree it could be useful in other places. Maybe formalized and >>> used with wake_up_nr() instead of adding a new primitive? Haven't looked >>> into that, I may be talking nonsense. >>> >>> In any case, I did get a chance to test it and it works for me. Here's >>> the "finished" version, slightly cleaned up and with a comment added >>> for good measure. >> >> Notes: >> >> This version gets the ordering right, you need exclusive waits to get >> fifo ordering on the waitqueue. >> >> Both versions (yours and mine) suffer from the problem of potentially >> waking too many. I don't think this is a real issue, as generally we >> don't do threaded access to the io_urings. But if you had the following >> tasks wait on the cqring: >> >> [min_events = 32], [min_events = 8], [min_events = 8] >> >> and we reach the io_cqring_events() == threshold, we'll wake all three. >> I don't see a good solution to this, so I suspect we just live with >> until proven an issue. Both versions are much better than what we have >> now. >> > If io_cqring_events() == 8, only the last two would be woken up in both > implementations, as to_wait/threshold is specified per waiter. Isn't it? If io_cqring_events() == 8, then none would be woken in my implementation since the first one will break the wakeup loop. > Agree with waiting, I don't see a good real-life case for that, that > couldn't be done efficiently in userspace. Exactly
On 9/24/19 3:21 AM, Pavel Begunkov wrote: > On 24/09/2019 11:02, Jens Axboe wrote: >> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>> structure with a count in it, on the stack. Got some flight time >>>>> coming up later today, let me try and cook up a patch. >>>> >>>> Totally untested, and sent out 5 min before departure... But something >>>> like this. >>> Hmm, reminds me my first version. Basically that's the same thing but >>> with macroses inlined. I wanted to make it reusable and self-contained, >>> though. >>> >>> If you don't think it could be useful in other places, sure, we could do >>> something like that. Is that so? >> >> I totally agree it could be useful in other places. Maybe formalized and >> used with wake_up_nr() instead of adding a new primitive? Haven't looked >> into that, I may be talking nonsense. > > @nr there is about number of tasks to wake up. AFAIK doesn't solve the > problem. Ah right, embarassingly I'm actually the one that added that functionality ages ago...
On 9/24/19 3:33 AM, Pavel Begunkov wrote: > > > On 24/09/2019 11:36, Jens Axboe wrote: >> On 9/24/19 2:27 AM, Jens Axboe wrote: >>> On 9/24/19 2:02 AM, Jens Axboe wrote: >>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote: >>>>> On 24/09/2019 02:00, Jens Axboe wrote: >>>>>>> I think we can do the same thing, just wrapping the waitqueue in a >>>>>>> structure with a count in it, on the stack. Got some flight time >>>>>>> coming up later today, let me try and cook up a patch. >>>>>> >>>>>> Totally untested, and sent out 5 min before departure... But something >>>>>> like this. >>>>> Hmm, reminds me my first version. Basically that's the same thing but >>>>> with macroses inlined. I wanted to make it reusable and self-contained, >>>>> though. >>>>> >>>>> If you don't think it could be useful in other places, sure, we could do >>>>> something like that. Is that so? >>>> >>>> I totally agree it could be useful in other places. Maybe formalized and >>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked >>>> into that, I may be talking nonsense. >>>> >>>> In any case, I did get a chance to test it and it works for me. Here's >>>> the "finished" version, slightly cleaned up and with a comment added >>>> for good measure. >>> >>> Notes: >>> >>> This version gets the ordering right, you need exclusive waits to get >>> fifo ordering on the waitqueue. >>> >>> Both versions (yours and mine) suffer from the problem of potentially >>> waking too many. I don't think this is a real issue, as generally we >>> don't do threaded access to the io_urings. But if you had the following >>> tasks wait on the cqring: >>> >>> [min_events = 32], [min_events = 8], [min_events = 8] >>> >>> and we reach the io_cqring_events() == threshold, we'll wake all three. >>> I don't see a good solution to this, so I suspect we just live with >>> until proven an issue. Both versions are much better than what we have >>> now. >> >> Forgot an issue around signal handling, version below adds the >> right check for that too. > > It seems to be a good reason to not keep reimplementing > "prepare_to_wait*() + wait loop" every time, but keep it in sched :) I think if we do the ->private cleanup that Peter mentioned, then there's not much left in terms of consolidation. Not convinced the case is interesting enough to warrant a special helper. If others show up, it's easy enough to consolidate the use cases and unify them. If you look at wake_up_nr(), I would have thought that would be more widespread. But it really isn't. >> Curious what your test case was for this? > You mean a performance test case? It's briefly described in a comment > for the second patch. That's just rewritten io_uring-bench, with > 1. a thread generating 1 request per call in a loop > 2. and the second thread waiting for ~128 events. > Both are pinned to the same core. Gotcha, thanks.
On 9/24/19 3:49 AM, Peter Zijlstra wrote: > On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: > >> +struct io_wait_queue { >> + struct wait_queue_entry wq; >> + struct io_ring_ctx *ctx; >> + struct task_struct *task; > > wq.private is where the normal waitqueue stores the task pointer. > > (I'm going to rename that) If you do that, then we can just base the io_uring parts on that. >> + unsigned to_wait; >> + unsigned nr_timeouts; >> +}; >> + >> +static inline bool io_should_wake(struct io_wait_queue *iowq) >> +{ >> + struct io_ring_ctx *ctx = iowq->ctx; >> + >> + /* >> + * Wake up if we have enough events, or if a timeout occured since we >> + * started waiting. For timeouts, we always want to return to userspace, >> + * regardless of event count. >> + */ >> + return io_cqring_events(ctx->rings) >= iowq->to_wait || >> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; >> +} >> + >> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, >> + int wake_flags, void *key) >> +{ >> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, >> + wq); >> + >> + if (io_should_wake(iowq)) { >> + list_del_init(&curr->entry); >> + wake_up_process(iowq->task); > > Then you can use autoremove_wake_function() here. > >> + return 1; >> + } >> + >> + return -1; >> +} > > Ideally we'd get wait_event()'s @cond in a custom wake function. Then we > can _always_ do this. > > This is one I'd love to have lambda functions for. It would actually > work with GCC nested functions, because the wake function will always be > in scope, but we can't use those in the kernel for other reasons :/ I'll be happy enough if I can just call autoremove_wake_function(), I think that will simplify the case enough for io_uring to not really make me care too much about going further. I'll leave that to you, if you have the desire :-)
On 9/24/19 4:13 AM, Jens Axboe wrote: > On 9/24/19 3:49 AM, Peter Zijlstra wrote: >> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: >> >>> +struct io_wait_queue { >>> + struct wait_queue_entry wq; >>> + struct io_ring_ctx *ctx; >>> + struct task_struct *task; >> >> wq.private is where the normal waitqueue stores the task pointer. >> >> (I'm going to rename that) > > If you do that, then we can just base the io_uring parts on that. Just took a quick look at it, and ran into block/kyber-iosched.c that actually uses the private pointer for something that isn't a task struct...
On 24/09/2019 13:34, Jens Axboe wrote: > On 9/24/19 4:13 AM, Jens Axboe wrote: >> On 9/24/19 3:49 AM, Peter Zijlstra wrote: >>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: >>> >>>> +struct io_wait_queue { >>>> + struct wait_queue_entry wq; >>>> + struct io_ring_ctx *ctx; >>>> + struct task_struct *task; >>> >>> wq.private is where the normal waitqueue stores the task pointer. >>> >>> (I'm going to rename that) >> >> If you do that, then we can just base the io_uring parts on that. > > Just took a quick look at it, and ran into block/kyber-iosched.c that > actually uses the private pointer for something that isn't a task > struct... > Let reuse autoremove_wake_function anyway. Changed a bit your patch: diff --git a/fs/io_uring.c b/fs/io_uring.c index 5c3f2bb81637..a77971290fdd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2690,6 +2690,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; } +struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + /* + * Wake up if we have enough events, or if a timeout occured since we + * started waiting. For timeouts, we always want to return to userspace, + * regardless of event count. + */ + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (!io_should_wake(iowq)) + return -1; + + return autoremove_wake_function(curr, mode, wake_flags, key); +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2697,8 +2729,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .private = current, + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret; if (io_cqring_events(rings) >= min_events) @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } - nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + do { + if (io_should_wake(&iowq)) + break; + schedule(); + if (signal_pending(current)) + break; + set_current_state(TASK_INTERRUPTIBLE); + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
On 9/24/19 5:11 AM, Pavel Begunkov wrote: > On 24/09/2019 13:34, Jens Axboe wrote: >> On 9/24/19 4:13 AM, Jens Axboe wrote: >>> On 9/24/19 3:49 AM, Peter Zijlstra wrote: >>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: >>>> >>>>> +struct io_wait_queue { >>>>> + struct wait_queue_entry wq; >>>>> + struct io_ring_ctx *ctx; >>>>> + struct task_struct *task; >>>> >>>> wq.private is where the normal waitqueue stores the task pointer. >>>> >>>> (I'm going to rename that) >>> >>> If you do that, then we can just base the io_uring parts on that. >> >> Just took a quick look at it, and ran into block/kyber-iosched.c that >> actually uses the private pointer for something that isn't a task >> struct... >> > > Let reuse autoremove_wake_function anyway. Changed a bit your patch: Yep that should do it, and saves 8 bytes of stack as well. BTW, did you test my patch, this one or the previous? Just curious if it worked for you.
On 24/09/2019 14:15, Jens Axboe wrote: > On 9/24/19 5:11 AM, Pavel Begunkov wrote: >> On 24/09/2019 13:34, Jens Axboe wrote: >>> On 9/24/19 4:13 AM, Jens Axboe wrote: >>>> On 9/24/19 3:49 AM, Peter Zijlstra wrote: >>>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote: >>>>> >>>>>> +struct io_wait_queue { >>>>>> + struct wait_queue_entry wq; >>>>>> + struct io_ring_ctx *ctx; >>>>>> + struct task_struct *task; >>>>> >>>>> wq.private is where the normal waitqueue stores the task pointer. >>>>> >>>>> (I'm going to rename that) >>>> >>>> If you do that, then we can just base the io_uring parts on that. >>> >>> Just took a quick look at it, and ran into block/kyber-iosched.c that >>> actually uses the private pointer for something that isn't a task >>> struct... >>> >> >> Let reuse autoremove_wake_function anyway. Changed a bit your patch: > > Yep that should do it, and saves 8 bytes of stack as well. > > BTW, did you test my patch, this one or the previous? Just curious if it > worked for you. > Not yet, going to do that tonight
On Tue, Sep 24, 2019 at 12:34:17PM +0200, Jens Axboe wrote: > Just took a quick look at it, and ran into block/kyber-iosched.c that > actually uses the private pointer for something that isn't a task > struct... Argh... that's some 'creative' abuse of the waitqueue API :/
On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote: > @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > return ret; > } > > + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); > + do { > + if (io_should_wake(&iowq)) > + break; > + schedule(); > + if (signal_pending(current)) > + break; > + set_current_state(TASK_INTERRUPTIBLE); > + } while (1); > + finish_wait(&ctx->wait, &iowq.wq); It it likely OK, but for paranoia, I'd prefer this form: for (;;) { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); if (io_should_wake(&iowq)) break; schedule(); if (signal_pending(current)) break; } finish_wait(&ctx->wait, &iowq.wq); The thing is, if we ever succeed with io_wake_function() (that CPU observes io_should_wake()), but when waking here, we do not observe is_wake_function() and go sleep again, we might never wake up if we don't put ourselves back on the wait-list again.
On 9/24/19 5:43 AM, Peter Zijlstra wrote: > On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote: > >> @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, >> return ret; >> } >> >> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); >> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); >> + do { >> + if (io_should_wake(&iowq)) >> + break; >> + schedule(); >> + if (signal_pending(current)) >> + break; >> + set_current_state(TASK_INTERRUPTIBLE); >> + } while (1); >> + finish_wait(&ctx->wait, &iowq.wq); > > It it likely OK, but for paranoia, I'd prefer this form: > > for (;;) { > prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); > if (io_should_wake(&iowq)) > break; > schedule(); > if (signal_pending(current)) > break; > } > finish_wait(&ctx->wait, &iowq.wq); > > The thing is, if we ever succeed with io_wake_function() (that CPU > observes io_should_wake()), but when waking here, we do not observe > is_wake_function() and go sleep again, we might never wake up if we > don't put ourselves back on the wait-list again. Might be paranoia indeed, but especially after this change, we don't expect to make frequent roundtrips there. So better safe than sorry, I'll make the change.
On 9/24/19 5:23 AM, Pavel Begunkov wrote: >> Yep that should do it, and saves 8 bytes of stack as well. >> >> BTW, did you test my patch, this one or the previous? Just curious if it >> worked for you. >> > Not yet, going to do that tonight Thanks! For reference, the final version is below. There was still a signal mishap in there, now it should all be correct afaict. diff --git a/fs/io_uring.c b/fs/io_uring.c index 9b84232e5cc4..d2a86164d520 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; } +struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + /* + * Wake up if we have enough events, or if a timeout occured since we + * started waiting. For timeouts, we always want to return to userspace, + * regardless of event count. + */ + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (!io_should_wake(iowq)) + return -1; + + return autoremove_wake_function(curr, mode, wake_flags, key); +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .private = current, + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret; if (io_cqring_events(rings) >= min_events) @@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } - nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + do { + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, + TASK_INTERRUPTIBLE); + if (io_should_wake(&iowq)) + break; + schedule(); + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
On 24/09/2019 16:13, Jens Axboe wrote: > On 9/24/19 5:23 AM, Pavel Begunkov wrote: >>> Yep that should do it, and saves 8 bytes of stack as well. >>> >>> BTW, did you test my patch, this one or the previous? Just curious if it >>> worked for you. >>> >> Not yet, going to do that tonight > > Thanks! For reference, the final version is below. There was still a > signal mishap in there, now it should all be correct afaict. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 9b84232e5cc4..d2a86164d520 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > return submit; > } > > +struct io_wait_queue { > + struct wait_queue_entry wq; > + struct io_ring_ctx *ctx; > + unsigned to_wait; > + unsigned nr_timeouts; > +}; > + > +static inline bool io_should_wake(struct io_wait_queue *iowq) > +{ > + struct io_ring_ctx *ctx = iowq->ctx; > + > + /* > + * Wake up if we have enough events, or if a timeout occured since we > + * started waiting. For timeouts, we always want to return to userspace, > + * regardless of event count. > + */ > + return io_cqring_events(ctx->rings) >= iowq->to_wait || > + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; > +} > + > +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > + int wake_flags, void *key) > +{ > + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, > + wq); > + > + if (!io_should_wake(iowq)) > + return -1; It would try to schedule only the first task in the wait list. Is that the semantic you want? E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns after @32, and won't wake up the second one. > + > + return autoremove_wake_function(curr, mode, wake_flags, key); > +} > + > /* > * Wait until events become available, if we don't already have some. The > * application must reap them itself, as they reside on the shared cq ring. > @@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, > static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > const sigset_t __user *sig, size_t sigsz) > { > + struct io_wait_queue iowq = { > + .wq = { > + .private = current, > + .func = io_wake_function, > + .entry = LIST_HEAD_INIT(iowq.wq.entry), > + }, > + .ctx = ctx, > + .to_wait = min_events, > + }; > struct io_rings *rings = ctx->rings; > - unsigned nr_timeouts; > int ret; > > if (io_cqring_events(rings) >= min_events) > @@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, > return ret; > } > > - nr_timeouts = atomic_read(&ctx->cq_timeouts); > - /* > - * Return if we have enough events, or if a timeout occured since > - * we started waiting. For timeouts, we always want to return to > - * userspace. > - */ > - ret = wait_event_interruptible(ctx->wait, > - io_cqring_events(rings) >= min_events || > - atomic_read(&ctx->cq_timeouts) != nr_timeouts); > + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + do { > + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, > + TASK_INTERRUPTIBLE); > + if (io_should_wake(&iowq)) > + break; > + schedule(); > + if (signal_pending(current)) { > + ret = -ERESTARTSYS; > + break; > + } > + } while (1); > + finish_wait(&ctx->wait, &iowq.wq); > + > restore_saved_sigmask_unless(ret == -ERESTARTSYS); > if (ret == -ERESTARTSYS) > ret = -EINTR; >
On 9/24/19 11:33 AM, Pavel Begunkov wrote: > On 24/09/2019 16:13, Jens Axboe wrote: >> On 9/24/19 5:23 AM, Pavel Begunkov wrote: >>>> Yep that should do it, and saves 8 bytes of stack as well. >>>> >>>> BTW, did you test my patch, this one or the previous? Just curious if it >>>> worked for you. >>>> >>> Not yet, going to do that tonight >> >> Thanks! For reference, the final version is below. There was still a >> signal mishap in there, now it should all be correct afaict. >> >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 9b84232e5cc4..d2a86164d520 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, >> return submit; >> } >> >> +struct io_wait_queue { >> + struct wait_queue_entry wq; >> + struct io_ring_ctx *ctx; >> + unsigned to_wait; >> + unsigned nr_timeouts; >> +}; >> + >> +static inline bool io_should_wake(struct io_wait_queue *iowq) >> +{ >> + struct io_ring_ctx *ctx = iowq->ctx; >> + >> + /* >> + * Wake up if we have enough events, or if a timeout occured since we >> + * started waiting. For timeouts, we always want to return to userspace, >> + * regardless of event count. >> + */ >> + return io_cqring_events(ctx->rings) >= iowq->to_wait || >> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; >> +} >> + >> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, >> + int wake_flags, void *key) >> +{ >> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, >> + wq); >> + >> + if (!io_should_wake(iowq)) >> + return -1; > > It would try to schedule only the first task in the wait list. Is that the > semantic you want? > E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns > after @32, and won't wake up the second one. Right, those are the semantics I want. We keep the list ordered by using the exclusive wait addition. Which means that for the case you list, waiters=32 came first, and we should not wake others before that task gets the completions it wants. Otherwise we could potentially starve higher count waiters, if we always keep going and new waiters come in.
On 24/09/2019 20:46, Jens Axboe wrote: > On 9/24/19 11:33 AM, Pavel Begunkov wrote: >> On 24/09/2019 16:13, Jens Axboe wrote: >>> On 9/24/19 5:23 AM, Pavel Begunkov wrote: >>>>> Yep that should do it, and saves 8 bytes of stack as well. >>>>> >>>>> BTW, did you test my patch, this one or the previous? Just curious if it >>>>> worked for you. >>>>> >>>> Not yet, going to do that tonight >>> >>> Thanks! For reference, the final version is below. There was still a >>> signal mishap in there, now it should all be correct afaict. >>> >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index 9b84232e5cc4..d2a86164d520 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, >>> return submit; >>> } >>> >>> +struct io_wait_queue { >>> + struct wait_queue_entry wq; >>> + struct io_ring_ctx *ctx; >>> + unsigned to_wait; >>> + unsigned nr_timeouts; >>> +}; >>> + >>> +static inline bool io_should_wake(struct io_wait_queue *iowq) >>> +{ >>> + struct io_ring_ctx *ctx = iowq->ctx; >>> + >>> + /* >>> + * Wake up if we have enough events, or if a timeout occured since we >>> + * started waiting. For timeouts, we always want to return to userspace, >>> + * regardless of event count. >>> + */ >>> + return io_cqring_events(ctx->rings) >= iowq->to_wait || >>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; >>> +} >>> + >>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, >>> + int wake_flags, void *key) >>> +{ >>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, >>> + wq); >>> + >>> + if (!io_should_wake(iowq)) >>> + return -1; >> >> It would try to schedule only the first task in the wait list. Is that the >> semantic you want? >> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns >> after @32, and won't wake up the second one. > > Right, those are the semantics I want. We keep the list ordered by using > the exclusive wait addition. Which means that for the case you list, > waiters=32 came first, and we should not wake others before that task > gets the completions it wants. Otherwise we could potentially starve > higher count waiters, if we always keep going and new waiters come in. > Yes. I think It would better to be documented in userspace API. I could imagine some crazy case deadlocking userspace. E.g. thread 1: wait_events(8), reap_events thread 2: wait_events(32), wait(thread 1), reap_events works well Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Pavel Begunkov <asml.silence@gmail.com> BTW, I searched for wait_event*(), and it seems there are plenty of similar use cases. So, generic case would be useful, but this is for later.
On 9/24/19 12:28 PM, Pavel Begunkov wrote: > On 24/09/2019 20:46, Jens Axboe wrote: >> On 9/24/19 11:33 AM, Pavel Begunkov wrote: >>> On 24/09/2019 16:13, Jens Axboe wrote: >>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote: >>>>>> Yep that should do it, and saves 8 bytes of stack as well. >>>>>> >>>>>> BTW, did you test my patch, this one or the previous? Just curious if it >>>>>> worked for you. >>>>>> >>>>> Not yet, going to do that tonight >>>> >>>> Thanks! For reference, the final version is below. There was still a >>>> signal mishap in there, now it should all be correct afaict. >>>> >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index 9b84232e5cc4..d2a86164d520 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, >>>> return submit; >>>> } >>>> >>>> +struct io_wait_queue { >>>> + struct wait_queue_entry wq; >>>> + struct io_ring_ctx *ctx; >>>> + unsigned to_wait; >>>> + unsigned nr_timeouts; >>>> +}; >>>> + >>>> +static inline bool io_should_wake(struct io_wait_queue *iowq) >>>> +{ >>>> + struct io_ring_ctx *ctx = iowq->ctx; >>>> + >>>> + /* >>>> + * Wake up if we have enough events, or if a timeout occured since we >>>> + * started waiting. For timeouts, we always want to return to userspace, >>>> + * regardless of event count. >>>> + */ >>>> + return io_cqring_events(ctx->rings) >= iowq->to_wait || >>>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; >>>> +} >>>> + >>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, >>>> + int wake_flags, void *key) >>>> +{ >>>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, >>>> + wq); >>>> + >>>> + if (!io_should_wake(iowq)) >>>> + return -1; >>> >>> It would try to schedule only the first task in the wait list. Is that the >>> semantic you want? >>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns >>> after @32, and won't wake up the second one. >> >> Right, those are the semantics I want. We keep the list ordered by using >> the exclusive wait addition. Which means that for the case you list, >> waiters=32 came first, and we should not wake others before that task >> gets the completions it wants. Otherwise we could potentially starve >> higher count waiters, if we always keep going and new waiters come in. >> > Yes. I think It would better to be documented in userspace API. I > could imagine some crazy case deadlocking userspace. E.g. > thread 1: wait_events(8), reap_events > thread 2: wait_events(32), wait(thread 1), reap_events No matter how you handle cases like this, there will always be deadlocks possible... So I don't think that's a huge concern. It's more important to not have intentional livelocks, which we would have if we always allowed the lowest wait count to get woken and steal the budget everytime. > works well > Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> > Tested-by: Pavel Begunkov <asml.silence@gmail.com> Thanks, will add! > BTW, I searched for wait_event*(), and it seems there are plenty of > similar use cases. So, generic case would be useful, but this is for > later. Agree, it would undoubtedly be useful.
From: Pavel Begunkov <asml.silence@gmail.com> There could be a lot of overhead within generic wait_event_*() used for waiting for large number of completions. The patchset removes much of it by using custom wait event (wait_threshold). Synthetic test showed ~40% performance boost. (see patch 2) v2: rebase Pavel Begunkov (2): sched/wait: Add wait_threshold io_uring: Optimise cq waiting with wait_threshold fs/io_uring.c | 35 ++++++++++++------ include/linux/wait_threshold.h | 67 ++++++++++++++++++++++++++++++++++ kernel/sched/Makefile | 2 +- kernel/sched/wait_threshold.c | 26 +++++++++++++ 4 files changed, 118 insertions(+), 12 deletions(-) create mode 100644 include/linux/wait_threshold.h create mode 100644 kernel/sched/wait_threshold.c