diff mbox series

[v2,1/2] sched/wait: Add wait_threshold

Message ID d99ce2f7c98d4408aea50f515bbb6b89bc7850e8.1569139018.git.asml.silence@gmail.com (mailing list archive)
State New, archived
Headers show
Series Optimise io_uring completion waiting | expand

Commit Message

Pavel Begunkov Sept. 22, 2019, 8:08 a.m. UTC
From: Pavel Begunkov <asml.silence@gmail.com>

Add wait_threshold -- a custom wait_event derivative, that waits until
a value is equal to or greater than the specified threshold.

v2: rebase
1. use full condition instead of event number generator
2. add WQ_THRESHOLD_WAKE_ALWAYS constant

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/wait_threshold.h | 67 ++++++++++++++++++++++++++++++++++
 kernel/sched/Makefile          |  2 +-
 kernel/sched/wait_threshold.c  | 26 +++++++++++++
 3 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/wait_threshold.h
 create mode 100644 kernel/sched/wait_threshold.c

Comments

Peter Zijlstra Sept. 23, 2019, 7:19 a.m. UTC | #1
On Sun, Sep 22, 2019 at 11:08:50AM +0300, Pavel Begunkov (Silence) wrote:
> From: Pavel Begunkov <asml.silence@gmail.com>
> 
> Add wait_threshold -- a custom wait_event derivative, that waits until
> a value is equal to or greater than the specified threshold.

This is quite insufficient justification for this monster... what exact
semantics do you want?

Why can't you do this exact same with a slightly more complicated @cond
?
Pavel Begunkov Sept. 23, 2019, 4:37 p.m. UTC | #2
Just in case duplicating a mail from the cover-letter thread:


It could be done with @cond indeed, that's how it works for now.
However, this addresses performance issues only.

The problem with wait_event_*() is that, if we have a counter and are
trying to wake up tasks after each increment, it would schedule each
waiting task O(threshold) times just for it to spuriously check @cond
and go back to sleep. All that overhead (memory barriers, registers
save/load, accounting, etc) turned out to be enough for some workloads
to slow down the system.

With this specialisation it still traverses a wait list and makes
indirect calls to the checker callback, but the list supposedly is
fairly  small, so performance there shouldn't be a problem, at least for
now.

Regarding semantics; It should wake a task when a value passed to
wake_up_threshold() is greater or equal then a task's threshold, that is
specified individually for each task in wait_threshold_*().

In pseudo code:
```
def wake_up_threshold(n, wait_queue):
	for waiter in wait_queue:
		waiter.wake_up_if(n >= waiter.threshold);
```

Any thoughts how to do it better? Ideas are very welcome.

BTW, this monster is mostly a copy-paste from wait_event_*(),
wait_bit_*(). We could try to extract some common parts from these
three, but that's another topic.


On 23/09/2019 10:19, Peter Zijlstra wrote:
> On Sun, Sep 22, 2019 at 11:08:50AM +0300, Pavel Begunkov (Silence) wrote:
>> From: Pavel Begunkov <asml.silence@gmail.com>
>>
>> Add wait_threshold -- a custom wait_event derivative, that waits until
>> a value is equal to or greater than the specified threshold.
> 
> This is quite insufficient justification for this monster... what exact
> semantics do you want?
> 
> Why can't you do this exact same with a slightly more complicated @cond
> ?
>
Peter Zijlstra Sept. 23, 2019, 7:27 p.m. UTC | #3
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Mon, Sep 23, 2019 at 07:37:46PM +0300, Pavel Begunkov wrote:
> Just in case duplicating a mail from the cover-letter thread:

Just because every patch should have a self contained and coherent
Changelog.

> It could be done with @cond indeed, that's how it works for now.
> However, this addresses performance issues only.
> 
> The problem with wait_event_*() is that, if we have a counter and are
> trying to wake up tasks after each increment, it would schedule each
> waiting task O(threshold) times just for it to spuriously check @cond
> and go back to sleep. All that overhead (memory barriers, registers
> save/load, accounting, etc) turned out to be enough for some workloads
> to slow down the system.
> 
> With this specialisation it still traverses a wait list and makes
> indirect calls to the checker callback, but the list supposedly is
> fairly  small, so performance there shouldn't be a problem, at least for
> now.
> 
> Regarding semantics; It should wake a task when a value passed to
> wake_up_threshold() is greater or equal then a task's threshold, that is
> specified individually for each task in wait_threshold_*().
> 
> In pseudo code:
> ```
> def wake_up_threshold(n, wait_queue):
> 	for waiter in wait_queue:
> 		waiter.wake_up_if(n >= waiter.threshold);
> ```
> 
> Any thoughts how to do it better? Ideas are very welcome.
> 
> BTW, this monster is mostly a copy-paste from wait_event_*(),
> wait_bit_*(). We could try to extract some common parts from these
> three, but that's another topic.

I don't think that is another topic at all. It is a quality of
implementation issue. We already have too many copies of all that (3).

So basically you want to fudge the wake function to do the/a @cond test,
not unlike what wait_bit already does, but differenly.

I'm really rather annoyed with C for not having proper lambda functions;
that would make all this so much easier. Anyway, let me have a poke at
this in the morning, it's late already.

Also, is anything actually using wait_queue_entry::private ? I'm
not finding any in a hurry.
Peter Zijlstra Sept. 23, 2019, 8:23 p.m. UTC | #4
On Mon, Sep 23, 2019 at 09:27:42PM +0200, Peter Zijlstra wrote:

> Also, is anything actually using wait_queue_entry::private ? I'm
> not finding any in a hurry.

That is; outside of default_wake_function(). I don't see a sensible
reason for that field to be 'void *' nor for the name 'private'.

But _that_ is something we can fix later.
Pavel Begunkov Sept. 24, 2019, 6:44 a.m. UTC | #5
On 23/09/2019 22:27, Peter Zijlstra wrote:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
> 
> On Mon, Sep 23, 2019 at 07:37:46PM +0300, Pavel Begunkov wrote:
>> Just in case duplicating a mail from the cover-letter thread:
> 
> Just because every patch should have a self contained and coherent
> Changelog.

Well, I will expand the patch description, if we agree on the
implementation (at least conceptually).


>>
>> BTW, this monster is mostly a copy-paste from wait_event_*(),
>> wait_bit_*(). We could try to extract some common parts from these
>> three, but that's another topic.
> 
> I don't think that is another topic at all. It is a quality of
> implementation issue. We already have too many copies of all that (3).

For example, ___wait_event() is copied in ___wait_var_event(). Instead
it could accept a wait entry generator or just accept entry from above
and be reused in both cases. I've had such a patch, but want to think
what else could be done.

e.g.
```
#define generic_wait_event(ENTRY_GEN, ...)
	ENTRY_GEN(wq_entry_name);
	do_wait_event(wq_entry_name);

#define WBQ_ENTRY_GEN(name)
	struct wait_bit_queue_entry tmp = WBQ_INITIALIZER;
	struct wait_queue_entry	name = &tmp->wq_entry;
```


> 
> So basically you want to fudge the wake function to do the/a @cond test,
> not unlike what wait_bit already does, but differenly.
> 
Yes

> I'm really rather annoyed with C for not having proper lambda functions;
> that would make all this so much easier. Anyway, let me have a poke at
> this in the morning, it's late already.
> 
> Also, is anything actually using wait_queue_entry::private ? I'm
> not finding any in a hurry.
> 
>
diff mbox series

Patch

diff --git a/include/linux/wait_threshold.h b/include/linux/wait_threshold.h
new file mode 100644
index 000000000000..d8b054504c26
--- /dev/null
+++ b/include/linux/wait_threshold.h
@@ -0,0 +1,67 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_WAIT_THRESHOLD_H
+#define _LINUX_WAIT_THRESHOLD_H
+
+#include <linux/wait.h>
+
+#define WQ_THRESHOLD_WAKE_ALWAYS	(~0ui)
+
+struct wait_threshold_queue_entry {
+	struct wait_queue_entry wq_entry;
+	unsigned int threshold;
+};
+
+
+void init_wait_threshold_entry(struct wait_threshold_queue_entry *wtq_entry,
+				unsigned int threshold);
+
+static inline void wake_up_threshold(struct wait_queue_head *wq_head,
+					unsigned int val)
+{
+	void *arg = (void *)(unsigned long)val;
+
+	__wake_up(wq_head, TASK_NORMAL, 1, arg);
+}
+
+#define ___wait_threshold_event(q, thresh, condition, state,		\
+				exclusive, ret, cmd)			\
+({									\
+	__label__ __out;						\
+	struct wait_queue_head *__wq_head = &q;				\
+	struct wait_threshold_queue_entry __wtq_entry;			\
+	struct wait_queue_entry *__wq_entry = &__wtq_entry.wq_entry;	\
+	long __ret = ret; /* explicit shadow */				\
+									\
+	init_wait_threshold_entry(&__wtq_entry, thresh);		\
+	for (;;) {							\
+		long __int = prepare_to_wait_event(__wq_head,		\
+						   __wq_entry,		\
+						   state);		\
+		if (condition)						\
+			break;						\
+									\
+		if (___wait_is_interruptible(state) && __int) {		\
+			__ret = __int;					\
+			goto __out;					\
+		}							\
+									\
+		cmd;							\
+	}								\
+	finish_wait(__wq_head, __wq_entry);				\
+__out:	__ret;								\
+})
+
+#define __wait_threshold_interruptible(q, thresh, condition)		\
+	___wait_threshold_event(q, thresh, condition, TASK_INTERRUPTIBLE, 0, 0,\
+			  schedule())
+
+#define wait_threshold_interruptible(q, threshold, condition)	\
+({								\
+	int __ret = 0;						\
+	might_sleep();						\
+	if (!(condition))					\
+		__ret = __wait_threshold_interruptible(q,	\
+			threshold, condition);			\
+	__ret;							\
+})
+#endif /* _LINUX_WAIT_THRESHOLD_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 21fb5a5662b5..bb895a3184f9 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -18,7 +18,7 @@  endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
-obj-y += wait.o wait_bit.o swait.o completion.o
+obj-y += wait.o wait_bit.o wait_threshold.o swait.o completion.o
 
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
diff --git a/kernel/sched/wait_threshold.c b/kernel/sched/wait_threshold.c
new file mode 100644
index 000000000000..80a027c02ff3
--- /dev/null
+++ b/kernel/sched/wait_threshold.c
@@ -0,0 +1,26 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+#include "sched.h"
+#include <linux/wait_threshold.h>
+
+static int wake_threshold_function(struct wait_queue_entry *wq_entry,
+				   unsigned int mode, int sync, void *arg)
+{
+	unsigned int val = (unsigned int)(unsigned long)arg;
+	struct wait_threshold_queue_entry *wtq_entry =
+		container_of(wq_entry, struct wait_threshold_queue_entry,
+			wq_entry);
+
+	if (val < wtq_entry->threshold)
+		return 0;
+
+	return default_wake_function(wq_entry, mode, sync, arg);
+}
+
+void init_wait_threshold_entry(struct wait_threshold_queue_entry *wtq_entry,
+			       unsigned int threshold)
+{
+	init_wait_entry(&wtq_entry->wq_entry, 0);
+	wtq_entry->wq_entry.func = wake_threshold_function;
+	wtq_entry->threshold = threshold;
+}
+EXPORT_SYMBOL(init_wait_threshold_entry);