[v2,0/2] Optimise io_uring completion waiting

Message ID	cover.1569139018.git.asml.silence@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=1fuv=XR=vger.kernel.org=linux-block-owner@kernel.org> From: "Pavel Begunkov (Silence)" <asml.silence@gmail.com> To: Jens Axboe <axboe@kernel.dk>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Pavel Begunkov <asml.silence@gmail.com> Subject: [PATCH v2 0/2] Optimise io_uring completion waiting Date: Sun, 22 Sep 2019 11:08:49 +0300 Message-Id: <cover.1569139018.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	Optimise io_uring completion waiting \| expand [v2,0/2] Optimise io_uring completion waiting [v2,1/2] sched/wait: Add wait_threshold [v2,2/2] io_uring: Optimise cq waiting with wait_threshold

Pavel Begunkov Sept. 22, 2019, 8:08 a.m. UTC

From: Pavel Begunkov <asml.silence@gmail.com>

There could be a lot of overhead within generic wait_event_*() used for
waiting for large number of completions. The patchset removes much of
it by using custom wait event (wait_threshold).

Synthetic test showed ~40% performance boost. (see patch 2)

v2: rebase

Pavel Begunkov (2):
  sched/wait: Add wait_threshold
  io_uring: Optimise cq waiting with wait_threshold

 fs/io_uring.c                  | 35 ++++++++++++------
 include/linux/wait_threshold.h | 67 ++++++++++++++++++++++++++++++++++
 kernel/sched/Makefile          |  2 +-
 kernel/sched/wait_threshold.c  | 26 +++++++++++++
 4 files changed, 118 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/wait_threshold.h
 create mode 100644 kernel/sched/wait_threshold.c

Jens Axboe Sept. 22, 2019, 3:51 p.m. UTC | #1

On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote:
> From: Pavel Begunkov <asml.silence@gmail.com>
> 
> There could be a lot of overhead within generic wait_event_*() used for
> waiting for large number of completions. The patchset removes much of
> it by using custom wait event (wait_threshold).
> 
> Synthetic test showed ~40% performance boost. (see patch 2)

I'm fine with the io_uring side of things, but to queue this up we
really need Peter or Ingo to sign off on the core wakeup bits...

Peter?

Ingo Molnar Sept. 23, 2019, 8:35 a.m. UTC | #2

* Jens Axboe <axboe@kernel.dk> wrote:

> On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote:
> > From: Pavel Begunkov <asml.silence@gmail.com>
> > 
> > There could be a lot of overhead within generic wait_event_*() used for
> > waiting for large number of completions. The patchset removes much of
> > it by using custom wait event (wait_threshold).
> > 
> > Synthetic test showed ~40% performance boost. (see patch 2)
> 
> I'm fine with the io_uring side of things, but to queue this up we
> really need Peter or Ingo to sign off on the core wakeup bits...
> 
> Peter?

I'm not sure an extension is needed for such a special interface, why not 
just put a ->threshold value next to the ctx->wait field and use either 
the regular wait_event() APIs with the proper condition, or 
wait_event_cmd() style APIs if you absolutely need something more complex 
to happen inside?

Should result in a much lower linecount and no scheduler changes. :-)

Thanks,

	Ingo

Pavel Begunkov Sept. 23, 2019, 4:21 p.m. UTC | #3

Hi, and thanks for the feedback.

It could be done with @cond indeed, that's how it works for now.
However, this addresses performance issues only.

The problem with wait_event_*() is that, if we have a counter and are
trying to wake up tasks after each increment, it would schedule each
waiting task O(threshold) times just for it to spuriously check @cond
and go back to sleep. All that overhead (memory barriers, registers
save/load, accounting, etc) turned out to be enough for some workloads
to slow down the system.

With this specialisation it still traverses a wait list and makes
indirect calls to the checker callback, but the list supposedly is
fairly  small, so performance there shouldn't be a problem, at least for
now.

Regarding semantics; It should wake a task when a value passed to
wake_up_threshold() is greater or equal then a task's threshold, that is
specified individually for each task in wait_threshold_*().

In pseudo code:
```
def wake_up_threshold(n, wait_queue):
	for waiter in wait_queue:
		waiter.wake_up_if(n >= waiter.threshold);
```

Any thoughts how to do it better? Ideas are very welcome.

BTW, this monster is mostly a copy-paste from wait_event_*(),
wait_bit_*(). We could try to extract some common parts from these
three, but that's another topic.

On 23/09/2019 11:35, Ingo Molnar wrote:
> 
> * Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 9/22/19 2:08 AM, Pavel Begunkov (Silence) wrote:
>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>
>>> There could be a lot of overhead within generic wait_event_*() used for
>>> waiting for large number of completions. The patchset removes much of
>>> it by using custom wait event (wait_threshold).
>>>
>>> Synthetic test showed ~40% performance boost. (see patch 2)
>>
>> I'm fine with the io_uring side of things, but to queue this up we
>> really need Peter or Ingo to sign off on the core wakeup bits...
>>
>> Peter?
> 
> I'm not sure an extension is needed for such a special interface, why not 
> just put a ->threshold value next to the ctx->wait field and use either 
> the regular wait_event() APIs with the proper condition, or 
> wait_event_cmd() style APIs if you absolutely need something more complex 
> to happen inside?
> 
> Should result in a much lower linecount and no scheduler changes. :-)
> 
> Thanks,
> 
> 	Ingo
>

Pavel Begunkov Sept. 23, 2019, 4:32 p.m. UTC | #4

Sorry, mixed the threads.

>>
>> I'm not sure an extension is needed for such a special interface, why not 
>> just put a ->threshold value next to the ctx->wait field and use either 
>> the regular wait_event() APIs with the proper condition, or 
>> wait_event_cmd() style APIs if you absolutely need something more complex 
>> to happen inside?
Ingo,
io_uring works well without this patch just using wait_event_*() with
proper condition, but there are performance issues with spurious
wakeups. Detailed description in the previous mail.
Am I missing something?
Thanks


>>
>> Should result in a much lower linecount and no scheduler changes. :-)
>>
>> Thanks,
>>
>> 	Ingo
>>
>

Jens Axboe Sept. 23, 2019, 8:48 p.m. UTC | #5

On 9/23/19 10:32 AM, Pavel Begunkov wrote:
> Sorry, mixed the threads.
> 
>>>
>>> I'm not sure an extension is needed for such a special interface, why not
>>> just put a ->threshold value next to the ctx->wait field and use either
>>> the regular wait_event() APIs with the proper condition, or
>>> wait_event_cmd() style APIs if you absolutely need something more complex
>>> to happen inside?
> Ingo,
> io_uring works well without this patch just using wait_event_*() with
> proper condition, but there are performance issues with spurious
> wakeups. Detailed description in the previous mail.
> Am I missing something?

I think we can do the same thing, just wrapping the waitqueue in a
structure with a count in it, on the stack. Got some flight time
coming up later today, let me try and cook up a patch.

Jens Axboe Sept. 23, 2019, 11 p.m. UTC | #6

On 9/23/19 2:48 PM, Jens Axboe wrote:
> On 9/23/19 10:32 AM, Pavel Begunkov wrote:
>> Sorry, mixed the threads.
>>
>>>>
>>>> I'm not sure an extension is needed for such a special interface, why not
>>>> just put a ->threshold value next to the ctx->wait field and use either
>>>> the regular wait_event() APIs with the proper condition, or
>>>> wait_event_cmd() style APIs if you absolutely need something more complex
>>>> to happen inside?
>> Ingo,
>> io_uring works well without this patch just using wait_event_*() with
>> proper condition, but there are performance issues with spurious
>> wakeups. Detailed description in the previous mail.
>> Am I missing something?
> 
> I think we can do the same thing, just wrapping the waitqueue in a
> structure with a count in it, on the stack. Got some flight time
> coming up later today, let me try and cook up a patch.

Totally untested, and sent out 5 min before departure... But something
like this.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..c2f9e1da26dd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 	return submit;
 }
 
+struct io_wait_queue {
+	struct wait_queue_entry wq;
+	struct io_ring_ctx *ctx;
+	struct task_struct *task;
+	unsigned to_wait;
+	unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+	struct io_ring_ctx *ctx = iowq->ctx;
+
+	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			    int wake_flags, void *key)
+{
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+							wq);
+
+	if (io_should_wake(iowq)) {
+		list_del_init(&curr->entry);
+		wake_up_process(iowq->task);
+		return 1;
+	}
+
+	return -1;
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			  const sigset_t __user *sig, size_t sigsz)
 {
+	struct io_wait_queue iowq = {
+		.wq = {
+			.func	= io_wake_function,
+			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
+		},
+		.task		= current,
+		.ctx		= ctx,
+		.to_wait	= min_events,
+	};
 	struct io_rings *rings = ctx->rings;
-	unsigned nr_timeouts;
 	int ret;
 
 	if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			return ret;
 	}
 
-	nr_timeouts = atomic_read(&ctx->cq_timeouts);
-	/*
-	 * Return if we have enough events, or if a timeout occured since
-	 * we started waiting. For timeouts, we always want to return to
-	 * userspace.
-	 */
-	ret = wait_event_interruptible(ctx->wait,
-				io_cqring_events(rings) >= min_events ||
-				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+	do {
+		if (io_should_wake(&iowq))
+			break;
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	} while (1);
+	finish_wait(&ctx->wait, &iowq.wq);
+
 	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;

Pavel Begunkov Sept. 24, 2019, 7:06 a.m. UTC | #7

On 24/09/2019 02:00, Jens Axboe wrote:
>> I think we can do the same thing, just wrapping the waitqueue in a
>> structure with a count in it, on the stack. Got some flight time
>> coming up later today, let me try and cook up a patch.
> 
> Totally untested, and sent out 5 min before departure... But something
> like this.
Hmm, reminds me my first version. Basically that's the same thing but
with macroses inlined. I wanted to make it reusable and self-contained,
though.

If you don't think it could be useful in other places, sure, we could do
something like that. Is that so?

> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..c2f9e1da26dd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  	return submit;
>  }
>  
> +struct io_wait_queue {
> +	struct wait_queue_entry wq;
> +	struct io_ring_ctx *ctx;
> +	struct task_struct *task;
> +	unsigned to_wait;
> +	unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> +	struct io_ring_ctx *ctx = iowq->ctx;
> +
> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> +			    int wake_flags, void *key)
> +{
> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> +							wq);
> +
> +	if (io_should_wake(iowq)) {
> +		list_del_init(&curr->entry);
> +		wake_up_process(iowq->task);
> +		return 1;
> +	}
> +
> +	return -1;
> +}
> +
>  /*
>   * Wait until events become available, if we don't already have some. The
>   * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			  const sigset_t __user *sig, size_t sigsz)
>  {
> +	struct io_wait_queue iowq = {
> +		.wq = {
> +			.func	= io_wake_function,
> +			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
> +		},
> +		.task		= current,
> +		.ctx		= ctx,
> +		.to_wait	= min_events,
> +	};
>  	struct io_rings *rings = ctx->rings;
> -	unsigned nr_timeouts;
>  	int ret;
>  
>  	if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			return ret;
>  	}
>  
> -	nr_timeouts = atomic_read(&ctx->cq_timeouts);
> -	/*
> -	 * Return if we have enough events, or if a timeout occured since
> -	 * we started waiting. For timeouts, we always want to return to
> -	 * userspace.
> -	 */
> -	ret = wait_event_interruptible(ctx->wait,
> -				io_cqring_events(rings) >= min_events ||
> -				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> +	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> +	do {
> +		if (io_should_wake(&iowq))
> +			break;
> +		schedule();
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	} while (1);
> +	finish_wait(&ctx->wait, &iowq.wq);
> +
>  	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>  	if (ret == -ERESTARTSYS)
>  		ret = -EINTR;
>

Jens Axboe Sept. 24, 2019, 8:02 a.m. UTC | #8

On 9/24/19 1:06 AM, Pavel Begunkov wrote:
> On 24/09/2019 02:00, Jens Axboe wrote:
>>> I think we can do the same thing, just wrapping the waitqueue in a
>>> structure with a count in it, on the stack. Got some flight time
>>> coming up later today, let me try and cook up a patch.
>>
>> Totally untested, and sent out 5 min before departure... But something
>> like this.
> Hmm, reminds me my first version. Basically that's the same thing but
> with macroses inlined. I wanted to make it reusable and self-contained,
> though.
> 
> If you don't think it could be useful in other places, sure, we could do
> something like that. Is that so?

I totally agree it could be useful in other places. Maybe formalized and
used with wake_up_nr() instead of adding a new primitive? Haven't looked
into that, I may be talking nonsense.

In any case, I did get a chance to test it and it works for me. Here's
the "finished" version, slightly cleaned up and with a comment added
for good measure.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..14fae454cf75 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 	return submit;
 }
 
+struct io_wait_queue {
+	struct wait_queue_entry wq;
+	struct io_ring_ctx *ctx;
+	struct task_struct *task;
+	unsigned to_wait;
+	unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+	struct io_ring_ctx *ctx = iowq->ctx;
+
+	/*
+	 * Wake up if we have enough events, or if a timeout occured since we
+	 * started waiting. For timeouts, we always want to return to userspace,
+	 * regardless of event count.
+	 */
+	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			    int wake_flags, void *key)
+{
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+							wq);
+
+	if (io_should_wake(iowq)) {
+		list_del_init(&curr->entry);
+		wake_up_process(iowq->task);
+		return 1;
+	}
+
+	return -1;
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			  const sigset_t __user *sig, size_t sigsz)
 {
+	struct io_wait_queue iowq = {
+		.wq = {
+			.func	= io_wake_function,
+			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
+		},
+		.task		= current,
+		.ctx		= ctx,
+		.to_wait	= min_events,
+	};
 	struct io_rings *rings = ctx->rings;
-	unsigned nr_timeouts;
 	int ret;
 
 	if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			return ret;
 	}
 
-	nr_timeouts = atomic_read(&ctx->cq_timeouts);
-	/*
-	 * Return if we have enough events, or if a timeout occured since
-	 * we started waiting. For timeouts, we always want to return to
-	 * userspace.
-	 */
-	ret = wait_event_interruptible(ctx->wait,
-				io_cqring_events(rings) >= min_events ||
-				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+	do {
+		if (io_should_wake(&iowq))
+			break;
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	} while (1);
+	finish_wait(&ctx->wait, &iowq.wq);
+
 	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;

Jens Axboe Sept. 24, 2019, 8:27 a.m. UTC | #9

On 9/24/19 2:02 AM, Jens Axboe wrote:
> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>> structure with a count in it, on the stack. Got some flight time
>>>> coming up later today, let me try and cook up a patch.
>>>
>>> Totally untested, and sent out 5 min before departure... But something
>>> like this.
>> Hmm, reminds me my first version. Basically that's the same thing but
>> with macroses inlined. I wanted to make it reusable and self-contained,
>> though.
>>
>> If you don't think it could be useful in other places, sure, we could do
>> something like that. Is that so?
> 
> I totally agree it could be useful in other places. Maybe formalized and
> used with wake_up_nr() instead of adding a new primitive? Haven't looked
> into that, I may be talking nonsense.
> 
> In any case, I did get a chance to test it and it works for me. Here's
> the "finished" version, slightly cleaned up and with a comment added
> for good measure.

Notes:

This version gets the ordering right, you need exclusive waits to get
fifo ordering on the waitqueue.

Both versions (yours and mine) suffer from the problem of potentially
waking too many. I don't think this is a real issue, as generally we
don't do threaded access to the io_urings. But if you had the following
tasks wait on the cqring:

[min_events = 32], [min_events = 8], [min_events = 8]

and we reach the io_cqring_events() == threshold, we'll wake all three.
I don't see a good solution to this, so I suspect we just live with
until proven an issue. Both versions are much better than what we have
now.

Jens Axboe Sept. 24, 2019, 8:36 a.m. UTC | #10

On 9/24/19 2:27 AM, Jens Axboe wrote:
> On 9/24/19 2:02 AM, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
>>
>> In any case, I did get a chance to test it and it works for me. Here's
>> the "finished" version, slightly cleaned up and with a comment added
>> for good measure.
> 
> Notes:
> 
> This version gets the ordering right, you need exclusive waits to get
> fifo ordering on the waitqueue.
> 
> Both versions (yours and mine) suffer from the problem of potentially
> waking too many. I don't think this is a real issue, as generally we
> don't do threaded access to the io_urings. But if you had the following
> tasks wait on the cqring:
> 
> [min_events = 32], [min_events = 8], [min_events = 8]
> 
> and we reach the io_cqring_events() == threshold, we'll wake all three.
> I don't see a good solution to this, so I suspect we just live with
> until proven an issue. Both versions are much better than what we have
> now.

Forgot an issue around signal handling, version below adds the
right check for that too.

Curious what your test case was for this?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..3fbab5692f14 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 	return submit;
 }
 
+struct io_wait_queue {
+	struct wait_queue_entry wq;
+	struct io_ring_ctx *ctx;
+	struct task_struct *task;
+	unsigned to_wait;
+	unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+	struct io_ring_ctx *ctx = iowq->ctx;
+
+	/*
+	 * Wake up if we have enough events, or if a timeout occured since we
+	 * started waiting. For timeouts, we always want to return to userspace,
+	 * regardless of event count.
+	 */
+	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			    int wake_flags, void *key)
+{
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+							wq);
+
+	if (io_should_wake(iowq)) {
+		list_del_init(&curr->entry);
+		wake_up_process(iowq->task);
+		return 1;
+	}
+
+	return -1;
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			  const sigset_t __user *sig, size_t sigsz)
 {
+	struct io_wait_queue iowq = {
+		.wq = {
+			.func	= io_wake_function,
+			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
+		},
+		.task		= current,
+		.ctx		= ctx,
+		.to_wait	= min_events,
+	};
 	struct io_rings *rings = ctx->rings;
-	unsigned nr_timeouts;
 	int ret;
 
 	if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			return ret;
 	}
 
-	nr_timeouts = atomic_read(&ctx->cq_timeouts);
-	/*
-	 * Return if we have enough events, or if a timeout occured since
-	 * we started waiting. For timeouts, we always want to return to
-	 * userspace.
-	 */
-	ret = wait_event_interruptible(ctx->wait,
-				io_cqring_events(rings) >= min_events ||
-				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+	do {
+		if (io_should_wake(&iowq))
+			break;
+		schedule();
+		if (signal_pending(current))
+			break;
+		set_current_state(TASK_INTERRUPTIBLE);
+	} while (1);
+	finish_wait(&ctx->wait, &iowq.wq);
+
 	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;

Pavel Begunkov Sept. 24, 2019, 9:20 a.m. UTC | #11

On 24/09/2019 11:27, Jens Axboe wrote:
> On 9/24/19 2:02 AM, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
>>
>> In any case, I did get a chance to test it and it works for me. Here's
>> the "finished" version, slightly cleaned up and with a comment added
>> for good measure.
> 
> Notes:
> 
> This version gets the ordering right, you need exclusive waits to get
> fifo ordering on the waitqueue.
> 
> Both versions (yours and mine) suffer from the problem of potentially
> waking too many. I don't think this is a real issue, as generally we
> don't do threaded access to the io_urings. But if you had the following
> tasks wait on the cqring:
> 
> [min_events = 32], [min_events = 8], [min_events = 8]
> 
> and we reach the io_cqring_events() == threshold, we'll wake all three.
> I don't see a good solution to this, so I suspect we just live with
> until proven an issue. Both versions are much better than what we have
> now.
> 
If io_cqring_events() == 8, only the last two would be woken up in both
implementations, as to_wait/threshold is specified per waiter. Isn't it?

Agree with waiting, I don't see a good real-life case for that, that
couldn't be done efficiently in userspace.

Pavel Begunkov Sept. 24, 2019, 9:21 a.m. UTC | #12

On 24/09/2019 11:02, Jens Axboe wrote:
> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>> structure with a count in it, on the stack. Got some flight time
>>>> coming up later today, let me try and cook up a patch.
>>>
>>> Totally untested, and sent out 5 min before departure... But something
>>> like this.
>> Hmm, reminds me my first version. Basically that's the same thing but
>> with macroses inlined. I wanted to make it reusable and self-contained,
>> though.
>>
>> If you don't think it could be useful in other places, sure, we could do
>> something like that. Is that so?
> 
> I totally agree it could be useful in other places. Maybe formalized and
> used with wake_up_nr() instead of adding a new primitive? Haven't looked
> into that, I may be talking nonsense.

@nr there is about number of tasks to wake up. AFAIK doesn't solve the
problem.


> 
> In any case, I did get a chance to test it and it works for me. Here's
> the "finished" version, slightly cleaned up and with a comment added
> for good measure.
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..14fae454cf75 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  	return submit;
>  }
>  
> +struct io_wait_queue {
> +	struct wait_queue_entry wq;
> +	struct io_ring_ctx *ctx;
> +	struct task_struct *task;
> +	unsigned to_wait;
> +	unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> +	struct io_ring_ctx *ctx = iowq->ctx;
> +
> +	/*
> +	 * Wake up if we have enough events, or if a timeout occured since we
> +	 * started waiting. For timeouts, we always want to return to userspace,
> +	 * regardless of event count.
> +	 */
> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> +			    int wake_flags, void *key)
> +{
> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> +							wq);
> +
> +	if (io_should_wake(iowq)) {
> +		list_del_init(&curr->entry);
> +		wake_up_process(iowq->task);
> +		return 1;
> +	}
> +
> +	return -1;
> +}
> +
>  /*
>   * Wait until events become available, if we don't already have some. The
>   * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			  const sigset_t __user *sig, size_t sigsz)
>  {
> +	struct io_wait_queue iowq = {
> +		.wq = {
> +			.func	= io_wake_function,
> +			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
> +		},
> +		.task		= current,
> +		.ctx		= ctx,
> +		.to_wait	= min_events,
> +	};
>  	struct io_rings *rings = ctx->rings;
> -	unsigned nr_timeouts;
>  	int ret;
>  
>  	if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			return ret;
>  	}
>  
> -	nr_timeouts = atomic_read(&ctx->cq_timeouts);
> -	/*
> -	 * Return if we have enough events, or if a timeout occured since
> -	 * we started waiting. For timeouts, we always want to return to
> -	 * userspace.
> -	 */
> -	ret = wait_event_interruptible(ctx->wait,
> -				io_cqring_events(rings) >= min_events ||
> -				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> +	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> +	do {
> +		if (io_should_wake(&iowq))
> +			break;
> +		schedule();
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	} while (1);
> +	finish_wait(&ctx->wait, &iowq.wq);
> +
>  	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>  	if (ret == -ERESTARTSYS)
>  		ret = -EINTR;
>

Pavel Begunkov Sept. 24, 2019, 9:33 a.m. UTC | #13

On 24/09/2019 11:36, Jens Axboe wrote:
> On 9/24/19 2:27 AM, Jens Axboe wrote:
>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>> coming up later today, let me try and cook up a patch.
>>>>>
>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>> like this.
>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>> though.
>>>>
>>>> If you don't think it could be useful in other places, sure, we could do
>>>> something like that. Is that so?
>>>
>>> I totally agree it could be useful in other places. Maybe formalized and
>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>> into that, I may be talking nonsense.
>>>
>>> In any case, I did get a chance to test it and it works for me. Here's
>>> the "finished" version, slightly cleaned up and with a comment added
>>> for good measure.
>>
>> Notes:
>>
>> This version gets the ordering right, you need exclusive waits to get
>> fifo ordering on the waitqueue.
>>
>> Both versions (yours and mine) suffer from the problem of potentially
>> waking too many. I don't think this is a real issue, as generally we
>> don't do threaded access to the io_urings. But if you had the following
>> tasks wait on the cqring:
>>
>> [min_events = 32], [min_events = 8], [min_events = 8]
>>
>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>> I don't see a good solution to this, so I suspect we just live with
>> until proven an issue. Both versions are much better than what we have
>> now.
> 
> Forgot an issue around signal handling, version below adds the
> right check for that too.

It seems to be a good reason to not keep reimplementing
"prepare_to_wait*() + wait loop" every time, but keep it in sched :)

> 
> Curious what your test case was for this?
You mean a performance test case? It's briefly described in a comment
for the second patch. That's just rewritten io_uring-bench, with
1. a thread generating 1 request per call in a loop
2. and the second thread waiting for ~128 events.
Both are pinned to the same core.

> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..3fbab5692f14 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  	return submit;
>  }
>  
> +struct io_wait_queue {
> +	struct wait_queue_entry wq;
> +	struct io_ring_ctx *ctx;
> +	struct task_struct *task;
> +	unsigned to_wait;
> +	unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> +	struct io_ring_ctx *ctx = iowq->ctx;
> +
> +	/*
> +	 * Wake up if we have enough events, or if a timeout occured since we
> +	 * started waiting. For timeouts, we always want to return to userspace,
> +	 * regardless of event count.
> +	 */
> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> +			    int wake_flags, void *key)
> +{
> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> +							wq);
> +
> +	if (io_should_wake(iowq)) {
> +		list_del_init(&curr->entry);
> +		wake_up_process(iowq->task);
> +		return 1;
> +	}
> +
> +	return -1;
> +}
> +
>  /*
>   * Wait until events become available, if we don't already have some. The
>   * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			  const sigset_t __user *sig, size_t sigsz)
>  {
> +	struct io_wait_queue iowq = {
> +		.wq = {
> +			.func	= io_wake_function,
> +			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
> +		},
> +		.task		= current,
> +		.ctx		= ctx,
> +		.to_wait	= min_events,
> +	};
>  	struct io_rings *rings = ctx->rings;
> -	unsigned nr_timeouts;
>  	int ret;
>  
>  	if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			return ret;
>  	}
>  
> -	nr_timeouts = atomic_read(&ctx->cq_timeouts);
> -	/*
> -	 * Return if we have enough events, or if a timeout occured since
> -	 * we started waiting. For timeouts, we always want to return to
> -	 * userspace.
> -	 */
> -	ret = wait_event_interruptible(ctx->wait,
> -				io_cqring_events(rings) >= min_events ||
> -				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> +	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> +	do {
> +		if (io_should_wake(&iowq))
> +			break;
> +		schedule();
> +		if (signal_pending(current))
> +			break;
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	} while (1);
> +	finish_wait(&ctx->wait, &iowq.wq);
> +
>  	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>  	if (ret == -ERESTARTSYS)
>  		ret = -EINTR;
>

Peter Zijlstra Sept. 24, 2019, 9:49 a.m. UTC | #14

On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:

> +struct io_wait_queue {
> +	struct wait_queue_entry wq;
> +	struct io_ring_ctx *ctx;
> +	struct task_struct *task;

wq.private is where the normal waitqueue stores the task pointer.

(I'm going to rename that)

> +	unsigned to_wait;
> +	unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> +	struct io_ring_ctx *ctx = iowq->ctx;
> +
> +	/*
> +	 * Wake up if we have enough events, or if a timeout occured since we
> +	 * started waiting. For timeouts, we always want to return to userspace,
> +	 * regardless of event count.
> +	 */
> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> +			    int wake_flags, void *key)
> +{
> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> +							wq);
> +
> +	if (io_should_wake(iowq)) {
> +		list_del_init(&curr->entry);
> +		wake_up_process(iowq->task);

Then you can use autoremove_wake_function() here.

> +		return 1;
> +	}
> +
> +	return -1;
> +}

Ideally we'd get wait_event()'s @cond in a custom wake function. Then we
can _always_ do this.

This is one I'd love to have lambda functions for. It would actually
work with GCC nested functions, because the wake function will always be
in scope, but we can't use those in the kernel for other reasons :/

Jens Axboe Sept. 24, 2019, 10:09 a.m. UTC | #15

On 9/24/19 3:20 AM, Pavel Begunkov wrote:
> 
> 
> On 24/09/2019 11:27, Jens Axboe wrote:
>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>> coming up later today, let me try and cook up a patch.
>>>>>
>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>> like this.
>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>> though.
>>>>
>>>> If you don't think it could be useful in other places, sure, we could do
>>>> something like that. Is that so?
>>>
>>> I totally agree it could be useful in other places. Maybe formalized and
>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>> into that, I may be talking nonsense.
>>>
>>> In any case, I did get a chance to test it and it works for me. Here's
>>> the "finished" version, slightly cleaned up and with a comment added
>>> for good measure.
>>
>> Notes:
>>
>> This version gets the ordering right, you need exclusive waits to get
>> fifo ordering on the waitqueue.
>>
>> Both versions (yours and mine) suffer from the problem of potentially
>> waking too many. I don't think this is a real issue, as generally we
>> don't do threaded access to the io_urings. But if you had the following
>> tasks wait on the cqring:
>>
>> [min_events = 32], [min_events = 8], [min_events = 8]
>>
>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>> I don't see a good solution to this, so I suspect we just live with
>> until proven an issue. Both versions are much better than what we have
>> now.
>>
> If io_cqring_events() == 8, only the last two would be woken up in both
> implementations, as to_wait/threshold is specified per waiter. Isn't it?

If io_cqring_events() == 8, then none would be woken in my
implementation since the first one will break the wakeup loop.

> Agree with waiting, I don't see a good real-life case for that, that
> couldn't be done efficiently in userspace.

Exactly

Jens Axboe Sept. 24, 2019, 10:09 a.m. UTC | #16

On 9/24/19 3:21 AM, Pavel Begunkov wrote:
> On 24/09/2019 11:02, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
> 
> @nr there is about number of tasks to wake up. AFAIK doesn't solve the
> problem.

Ah right, embarassingly I'm actually the one that added that
functionality ages ago...

Jens Axboe Sept. 24, 2019, 10:11 a.m. UTC | #17

On 9/24/19 3:33 AM, Pavel Begunkov wrote:
> 
> 
> On 24/09/2019 11:36, Jens Axboe wrote:
>> On 9/24/19 2:27 AM, Jens Axboe wrote:
>>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>>> coming up later today, let me try and cook up a patch.
>>>>>>
>>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>>> like this.
>>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>>> though.
>>>>>
>>>>> If you don't think it could be useful in other places, sure, we could do
>>>>> something like that. Is that so?
>>>>
>>>> I totally agree it could be useful in other places. Maybe formalized and
>>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>>> into that, I may be talking nonsense.
>>>>
>>>> In any case, I did get a chance to test it and it works for me. Here's
>>>> the "finished" version, slightly cleaned up and with a comment added
>>>> for good measure.
>>>
>>> Notes:
>>>
>>> This version gets the ordering right, you need exclusive waits to get
>>> fifo ordering on the waitqueue.
>>>
>>> Both versions (yours and mine) suffer from the problem of potentially
>>> waking too many. I don't think this is a real issue, as generally we
>>> don't do threaded access to the io_urings. But if you had the following
>>> tasks wait on the cqring:
>>>
>>> [min_events = 32], [min_events = 8], [min_events = 8]
>>>
>>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>>> I don't see a good solution to this, so I suspect we just live with
>>> until proven an issue. Both versions are much better than what we have
>>> now.
>>
>> Forgot an issue around signal handling, version below adds the
>> right check for that too.
> 
> It seems to be a good reason to not keep reimplementing
> "prepare_to_wait*() + wait loop" every time, but keep it in sched :)

I think if we do the ->private cleanup that Peter mentioned, then
there's not much left in terms of consolidation. Not convinced the case
is interesting enough to warrant a special helper. If others show up,
it's easy enough to consolidate the use cases and unify them.

If you look at wake_up_nr(), I would have thought that would be more
widespread. But it really isn't.

>> Curious what your test case was for this?
> You mean a performance test case? It's briefly described in a comment
> for the second patch. That's just rewritten io_uring-bench, with
> 1. a thread generating 1 request per call in a loop
> 2. and the second thread waiting for ~128 events.
> Both are pinned to the same core.

Gotcha, thanks.

Jens Axboe Sept. 24, 2019, 10:13 a.m. UTC | #18

On 9/24/19 3:49 AM, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
> 
>> +struct io_wait_queue {
>> +	struct wait_queue_entry wq;
>> +	struct io_ring_ctx *ctx;
>> +	struct task_struct *task;
> 
> wq.private is where the normal waitqueue stores the task pointer.
> 
> (I'm going to rename that)

If you do that, then we can just base the io_uring parts on that. 

>> +	unsigned to_wait;
>> +	unsigned nr_timeouts;
>> +};
>> +
>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>> +{
>> +	struct io_ring_ctx *ctx = iowq->ctx;
>> +
>> +	/*
>> +	 * Wake up if we have enough events, or if a timeout occured since we
>> +	 * started waiting. For timeouts, we always want to return to userspace,
>> +	 * regardless of event count.
>> +	 */
>> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>> +}
>> +
>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>> +			    int wake_flags, void *key)
>> +{
>> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>> +							wq);
>> +
>> +	if (io_should_wake(iowq)) {
>> +		list_del_init(&curr->entry);
>> +		wake_up_process(iowq->task);
> 
> Then you can use autoremove_wake_function() here.
> 
>> +		return 1;
>> +	}
>> +
>> +	return -1;
>> +}
> 
> Ideally we'd get wait_event()'s @cond in a custom wake function. Then we
> can _always_ do this.
> 
> This is one I'd love to have lambda functions for. It would actually
> work with GCC nested functions, because the wake function will always be
> in scope, but we can't use those in the kernel for other reasons :/

I'll be happy enough if I can just call autoremove_wake_function(), I
think that will simplify the case enough for io_uring to not really make
me care too much about going further. I'll leave that to you, if you
have the desire :-)

Jens Axboe Sept. 24, 2019, 10:34 a.m. UTC | #19

On 9/24/19 4:13 AM, Jens Axboe wrote:
> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>
>>> +struct io_wait_queue {
>>> +	struct wait_queue_entry wq;
>>> +	struct io_ring_ctx *ctx;
>>> +	struct task_struct *task;
>>
>> wq.private is where the normal waitqueue stores the task pointer.
>>
>> (I'm going to rename that)
> 
> If you do that, then we can just base the io_uring parts on that.

Just took a quick look at it, and ran into block/kyber-iosched.c that
actually uses the private pointer for something that isn't a task
struct...

Pavel Begunkov Sept. 24, 2019, 11:11 a.m. UTC | #20

On 24/09/2019 13:34, Jens Axboe wrote:
> On 9/24/19 4:13 AM, Jens Axboe wrote:
>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>
>>>> +struct io_wait_queue {
>>>> +	struct wait_queue_entry wq;
>>>> +	struct io_ring_ctx *ctx;
>>>> +	struct task_struct *task;
>>>
>>> wq.private is where the normal waitqueue stores the task pointer.
>>>
>>> (I'm going to rename that)
>>
>> If you do that, then we can just base the io_uring parts on that.
> 
> Just took a quick look at it, and ran into block/kyber-iosched.c that
> actually uses the private pointer for something that isn't a task
> struct...
> 

Let reuse autoremove_wake_function anyway. Changed a bit your patch:


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5c3f2bb81637..a77971290fdd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2690,6 +2690,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 	return submit;
 }
 
+struct io_wait_queue {
+	struct wait_queue_entry wq;
+	struct io_ring_ctx *ctx;
+	unsigned to_wait;
+	unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+	struct io_ring_ctx *ctx = iowq->ctx;
+
+	/*
+	 * Wake up if we have enough events, or if a timeout occured since we
+	 * started waiting. For timeouts, we always want to return to userspace,
+	 * regardless of event count.
+	 */
+	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			    int wake_flags, void *key)
+{
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+							wq);
+
+	if (!io_should_wake(iowq))
+		return -1;
+
+	return autoremove_wake_function(curr, mode, wake_flags, key);
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -2697,8 +2729,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			  const sigset_t __user *sig, size_t sigsz)
 {
+	struct io_wait_queue iowq = {
+		.wq = {
+			.private = current,
+			.func	= io_wake_function,
+			.entry	= LIST_HEAD_INIT(iowq.wq.entry),
+		},
+		.ctx		= ctx,
+		.to_wait	= min_events,
+	};
 	struct io_rings *rings = ctx->rings;
-	unsigned nr_timeouts;
 	int ret;
 
 	if (io_cqring_events(rings) >= min_events)
@@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			return ret;
 	}
 
-	nr_timeouts = atomic_read(&ctx->cq_timeouts);
-	/*
-	 * Return if we have enough events, or if a timeout occured since
-	 * we started waiting. For timeouts, we always want to return to
-	 * userspace.
-	 */
-	ret = wait_event_interruptible(ctx->wait,
-				io_cqring_events(rings) >= min_events ||
-				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+	do {
+		if (io_should_wake(&iowq))
+			break;
+		schedule();
+		if (signal_pending(current))
+			break;
+		set_current_state(TASK_INTERRUPTIBLE);
+	} while (1);
+	finish_wait(&ctx->wait, &iowq.wq);
+
 	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;

Jens Axboe Sept. 24, 2019, 11:15 a.m. UTC | #21

On 9/24/19 5:11 AM, Pavel Begunkov wrote:
> On 24/09/2019 13:34, Jens Axboe wrote:
>> On 9/24/19 4:13 AM, Jens Axboe wrote:
>>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>>
>>>>> +struct io_wait_queue {
>>>>> +	struct wait_queue_entry wq;
>>>>> +	struct io_ring_ctx *ctx;
>>>>> +	struct task_struct *task;
>>>>
>>>> wq.private is where the normal waitqueue stores the task pointer.
>>>>
>>>> (I'm going to rename that)
>>>
>>> If you do that, then we can just base the io_uring parts on that.
>>
>> Just took a quick look at it, and ran into block/kyber-iosched.c that
>> actually uses the private pointer for something that isn't a task
>> struct...
>>
> 
> Let reuse autoremove_wake_function anyway. Changed a bit your patch:

Yep that should do it, and saves 8 bytes of stack as well.

BTW, did you test my patch, this one or the previous? Just curious if it
worked for you.

Pavel Begunkov Sept. 24, 2019, 11:23 a.m. UTC | #22

On 24/09/2019 14:15, Jens Axboe wrote:
> On 9/24/19 5:11 AM, Pavel Begunkov wrote:
>> On 24/09/2019 13:34, Jens Axboe wrote:
>>> On 9/24/19 4:13 AM, Jens Axboe wrote:
>>>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>>>
>>>>>> +struct io_wait_queue {
>>>>>> +	struct wait_queue_entry wq;
>>>>>> +	struct io_ring_ctx *ctx;
>>>>>> +	struct task_struct *task;
>>>>>
>>>>> wq.private is where the normal waitqueue stores the task pointer.
>>>>>
>>>>> (I'm going to rename that)
>>>>
>>>> If you do that, then we can just base the io_uring parts on that.
>>>
>>> Just took a quick look at it, and ran into block/kyber-iosched.c that
>>> actually uses the private pointer for something that isn't a task
>>> struct...
>>>
>>
>> Let reuse autoremove_wake_function anyway. Changed a bit your patch:
> 
> Yep that should do it, and saves 8 bytes of stack as well.
> 
> BTW, did you test my patch, this one or the previous? Just curious if it
> worked for you.
> 
Not yet, going to do that tonight

Peter Zijlstra Sept. 24, 2019, 11:33 a.m. UTC | #23

On Tue, Sep 24, 2019 at 12:34:17PM +0200, Jens Axboe wrote:

> Just took a quick look at it, and ran into block/kyber-iosched.c that
> actually uses the private pointer for something that isn't a task
> struct...

Argh... that's some 'creative' abuse of the waitqueue API :/

Peter Zijlstra Sept. 24, 2019, 11:43 a.m. UTC | #24

On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote:

> @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			return ret;
>  	}
>  
> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> +	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> +	do {
> +		if (io_should_wake(&iowq))
> +			break;
> +		schedule();
> +		if (signal_pending(current))
> +			break;
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	} while (1);
> +	finish_wait(&ctx->wait, &iowq.wq);

It it likely OK, but for paranoia, I'd prefer this form:

	for (;;) {
		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
		if (io_should_wake(&iowq))
			break;
		schedule();
		if (signal_pending(current))
			break;
	}
	finish_wait(&ctx->wait, &iowq.wq);

The thing is, if we ever succeed with io_wake_function() (that CPU
observes io_should_wake()), but when waking here, we do not observe
is_wake_function() and go sleep again, we might never wake up if we
don't put ourselves back on the wait-list again.

Jens Axboe Sept. 24, 2019, 12:57 p.m. UTC | #25

On 9/24/19 5:43 AM, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote:
> 
>> @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>   			return ret;
>>   	}
>>   
>> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
>> +	prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
>> +	do {
>> +		if (io_should_wake(&iowq))
>> +			break;
>> +		schedule();
>> +		if (signal_pending(current))
>> +			break;
>> +		set_current_state(TASK_INTERRUPTIBLE);
>> +	} while (1);
>> +	finish_wait(&ctx->wait, &iowq.wq);
> 
> It it likely OK, but for paranoia, I'd prefer this form:
> 
> 	for (;;) {
> 		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> 		if (io_should_wake(&iowq))
> 			break;
> 		schedule();
> 		if (signal_pending(current))
> 			break;
> 	}
> 	finish_wait(&ctx->wait, &iowq.wq);
> 
> The thing is, if we ever succeed with io_wake_function() (that CPU
> observes io_should_wake()), but when waking here, we do not observe
> is_wake_function() and go sleep again, we might never wake up if we
> don't put ourselves back on the wait-list again.

Might be paranoia indeed, but especially after this change, we don't
expect to make frequent roundtrips there. So better safe than sorry,
I'll make the change.

Jens Axboe Sept. 24, 2019, 1:13 p.m. UTC | #26

On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>> Yep that should do it, and saves 8 bytes of stack as well.
>>
>> BTW, did you test my patch, this one or the previous? Just curious if it
>> worked for you.
>>
> Not yet, going to do that tonight

Thanks! For reference, the final version is below. There was still a
signal mishap in there, now it should all be correct afaict.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b84232e5cc4..d2a86164d520 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 	return submit;
 }
 
+struct io_wait_queue {
+	struct wait_queue_entry wq;
+	struct io_ring_ctx *ctx;
+	unsigned to_wait;
+	unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+	struct io_ring_ctx *ctx = iowq->ctx;
+
+	/*
+	 * Wake up if we have enough events, or if a timeout occured since we
+	 * started waiting. For timeouts, we always want to return to userspace,
+	 * regardless of event count.
+	 */
+	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			    int wake_flags, void *key)
+{
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+							wq);
+
+	if (!io_should_wake(iowq))
+		return -1;
+
+	return autoremove_wake_function(curr, mode, wake_flags, key);
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
 static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			  const sigset_t __user *sig, size_t sigsz)
 {
+	struct io_wait_queue iowq = {
+		.wq = {
+			.private	= current,
+			.func		= io_wake_function,
+			.entry		= LIST_HEAD_INIT(iowq.wq.entry),
+		},
+		.ctx		= ctx,
+		.to_wait	= min_events,
+	};
 	struct io_rings *rings = ctx->rings;
-	unsigned nr_timeouts;
 	int ret;
 
 	if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			return ret;
 	}
 
-	nr_timeouts = atomic_read(&ctx->cq_timeouts);
-	/*
-	 * Return if we have enough events, or if a timeout occured since
-	 * we started waiting. For timeouts, we always want to return to
-	 * userspace.
-	 */
-	ret = wait_event_interruptible(ctx->wait,
-				io_cqring_events(rings) >= min_events ||
-				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+	do {
+		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
+						TASK_INTERRUPTIBLE);
+		if (io_should_wake(&iowq))
+			break;
+		schedule();
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+	} while (1);
+	finish_wait(&ctx->wait, &iowq.wq);
+
 	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;

Pavel Begunkov Sept. 24, 2019, 5:33 p.m. UTC | #27

On 24/09/2019 16:13, Jens Axboe wrote:
> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>
>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>> worked for you.
>>>
>> Not yet, going to do that tonight
> 
> Thanks! For reference, the final version is below. There was still a
> signal mishap in there, now it should all be correct afaict.
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 9b84232e5cc4..d2a86164d520 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  	return submit;
>  }
>  
> +struct io_wait_queue {
> +	struct wait_queue_entry wq;
> +	struct io_ring_ctx *ctx;
> +	unsigned to_wait;
> +	unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> +	struct io_ring_ctx *ctx = iowq->ctx;
> +
> +	/*
> +	 * Wake up if we have enough events, or if a timeout occured since we
> +	 * started waiting. For timeouts, we always want to return to userspace,
> +	 * regardless of event count.
> +	 */
> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> +			    int wake_flags, void *key)
> +{
> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> +							wq);
> +
> +	if (!io_should_wake(iowq))
> +		return -1;

It would try to schedule only the first task in the wait list. Is that the
semantic you want?
E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
after @32, and won't wake up the second one. 

> +
> +	return autoremove_wake_function(curr, mode, wake_flags, key);
> +}
> +
>  /*
>   * Wait until events become available, if we don't already have some. The
>   * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			  const sigset_t __user *sig, size_t sigsz)
>  {
> +	struct io_wait_queue iowq = {
> +		.wq = {
> +			.private	= current,
> +			.func		= io_wake_function,
> +			.entry		= LIST_HEAD_INIT(iowq.wq.entry),
> +		},
> +		.ctx		= ctx,
> +		.to_wait	= min_events,
> +	};
>  	struct io_rings *rings = ctx->rings;
> -	unsigned nr_timeouts;
>  	int ret;
>  
>  	if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  			return ret;
>  	}
>  
> -	nr_timeouts = atomic_read(&ctx->cq_timeouts);
> -	/*
> -	 * Return if we have enough events, or if a timeout occured since
> -	 * we started waiting. For timeouts, we always want to return to
> -	 * userspace.
> -	 */
> -	ret = wait_event_interruptible(ctx->wait,
> -				io_cqring_events(rings) >= min_events ||
> -				atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> +	iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> +	do {
> +		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
> +						TASK_INTERRUPTIBLE);
> +		if (io_should_wake(&iowq))
> +			break;
> +		schedule();
> +		if (signal_pending(current)) {
> +			ret = -ERESTARTSYS;
> +			break;
> +		}
> +	} while (1);
> +	finish_wait(&ctx->wait, &iowq.wq);
> +
>  	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>  	if (ret == -ERESTARTSYS)
>  		ret = -EINTR;
>

Jens Axboe Sept. 24, 2019, 5:46 p.m. UTC | #28

On 9/24/19 11:33 AM, Pavel Begunkov wrote:
> On 24/09/2019 16:13, Jens Axboe wrote:
>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>
>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>> worked for you.
>>>>
>>> Not yet, going to do that tonight
>>
>> Thanks! For reference, the final version is below. There was still a
>> signal mishap in there, now it should all be correct afaict.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 9b84232e5cc4..d2a86164d520 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>   	return submit;
>>   }
>>   
>> +struct io_wait_queue {
>> +	struct wait_queue_entry wq;
>> +	struct io_ring_ctx *ctx;
>> +	unsigned to_wait;
>> +	unsigned nr_timeouts;
>> +};
>> +
>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>> +{
>> +	struct io_ring_ctx *ctx = iowq->ctx;
>> +
>> +	/*
>> +	 * Wake up if we have enough events, or if a timeout occured since we
>> +	 * started waiting. For timeouts, we always want to return to userspace,
>> +	 * regardless of event count.
>> +	 */
>> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>> +}
>> +
>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>> +			    int wake_flags, void *key)
>> +{
>> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>> +							wq);
>> +
>> +	if (!io_should_wake(iowq))
>> +		return -1;
> 
> It would try to schedule only the first task in the wait list. Is that the
> semantic you want?
> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
> after @32, and won't wake up the second one.

Right, those are the semantics I want. We keep the list ordered by using
the exclusive wait addition. Which means that for the case you list,
waiters=32 came first, and we should not wake others before that task
gets the completions it wants. Otherwise we could potentially starve
higher count waiters, if we always keep going and new waiters come in.

Pavel Begunkov Sept. 24, 2019, 6:28 p.m. UTC | #29

On 24/09/2019 20:46, Jens Axboe wrote:
> On 9/24/19 11:33 AM, Pavel Begunkov wrote:
>> On 24/09/2019 16:13, Jens Axboe wrote:
>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>>
>>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>>> worked for you.
>>>>>
>>>> Not yet, going to do that tonight
>>>
>>> Thanks! For reference, the final version is below. There was still a
>>> signal mishap in there, now it should all be correct afaict.
>>>
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 9b84232e5cc4..d2a86164d520 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>>   	return submit;
>>>   }
>>>   
>>> +struct io_wait_queue {
>>> +	struct wait_queue_entry wq;
>>> +	struct io_ring_ctx *ctx;
>>> +	unsigned to_wait;
>>> +	unsigned nr_timeouts;
>>> +};
>>> +
>>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>>> +{
>>> +	struct io_ring_ctx *ctx = iowq->ctx;
>>> +
>>> +	/*
>>> +	 * Wake up if we have enough events, or if a timeout occured since we
>>> +	 * started waiting. For timeouts, we always want to return to userspace,
>>> +	 * regardless of event count.
>>> +	 */
>>> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>>> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>>> +}
>>> +
>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>>> +			    int wake_flags, void *key)
>>> +{
>>> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>>> +							wq);
>>> +
>>> +	if (!io_should_wake(iowq))
>>> +		return -1;
>>
>> It would try to schedule only the first task in the wait list. Is that the
>> semantic you want?
>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
>> after @32, and won't wake up the second one.
> 
> Right, those are the semantics I want. We keep the list ordered by using
> the exclusive wait addition. Which means that for the case you list,
> waiters=32 came first, and we should not wake others before that task
> gets the completions it wants. Otherwise we could potentially starve
> higher count waiters, if we always keep going and new waiters come in.
> 
Yes. I think It would better to be documented in userspace API. I
could imagine some crazy case deadlocking userspace. E.g. 
thread 1: wait_events(8), reap_events
thread 2: wait_events(32), wait(thread 1), reap_events

works well
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>

BTW, I searched for wait_event*(), and it seems there are plenty of
similar use cases. So, generic case would be useful, but this is for
later.

Jens Axboe Sept. 24, 2019, 7:32 p.m. UTC | #30

On 9/24/19 12:28 PM, Pavel Begunkov wrote:
> On 24/09/2019 20:46, Jens Axboe wrote:
>> On 9/24/19 11:33 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 16:13, Jens Axboe wrote:
>>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>>>
>>>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>>>> worked for you.
>>>>>>
>>>>> Not yet, going to do that tonight
>>>>
>>>> Thanks! For reference, the final version is below. There was still a
>>>> signal mishap in there, now it should all be correct afaict.
>>>>
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 9b84232e5cc4..d2a86164d520 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>>>    	return submit;
>>>>    }
>>>>    
>>>> +struct io_wait_queue {
>>>> +	struct wait_queue_entry wq;
>>>> +	struct io_ring_ctx *ctx;
>>>> +	unsigned to_wait;
>>>> +	unsigned nr_timeouts;
>>>> +};
>>>> +
>>>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>>>> +{
>>>> +	struct io_ring_ctx *ctx = iowq->ctx;
>>>> +
>>>> +	/*
>>>> +	 * Wake up if we have enough events, or if a timeout occured since we
>>>> +	 * started waiting. For timeouts, we always want to return to userspace,
>>>> +	 * regardless of event count.
>>>> +	 */
>>>> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>>>> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>>>> +}
>>>> +
>>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>>>> +			    int wake_flags, void *key)
>>>> +{
>>>> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>>>> +							wq);
>>>> +
>>>> +	if (!io_should_wake(iowq))
>>>> +		return -1;
>>>
>>> It would try to schedule only the first task in the wait list. Is that the
>>> semantic you want?
>>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
>>> after @32, and won't wake up the second one.
>>
>> Right, those are the semantics I want. We keep the list ordered by using
>> the exclusive wait addition. Which means that for the case you list,
>> waiters=32 came first, and we should not wake others before that task
>> gets the completions it wants. Otherwise we could potentially starve
>> higher count waiters, if we always keep going and new waiters come in.
>>
> Yes. I think It would better to be documented in userspace API. I
> could imagine some crazy case deadlocking userspace. E.g.
> thread 1: wait_events(8), reap_events
> thread 2: wait_events(32), wait(thread 1), reap_events

No matter how you handle cases like this, there will always be deadlocks
possible... So I don't think that's a huge concern. It's more important
to not have intentional livelocks, which we would have if we always
allowed the lowest wait count to get woken and steal the budget
everytime.

> works well
> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
> Tested-by: Pavel Begunkov <asml.silence@gmail.com>

Thanks, will add!

> BTW, I searched for wait_event*(), and it seems there are plenty of
> similar use cases. So, generic case would be useful, but this is for
> later.

Agree, it would undoubtedly be useful.

[v2,0/2] Optimise io_uring completion waiting

Message

Comments