diff mbox series

[v2,1/2] sched/wait: Add add_wait_queue_priority()

Message ID 20201027143944.648769-2-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Allow KVM IRQFD to consistently intercept events | expand

Commit Message

David Woodhouse Oct. 27, 2020, 2:39 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first without allowing non-exclusive waiters to be woken at all.

The (first) intended use is for KVM IRQFD, which currently has
inconsistent behaviour depending on whether posted interrupts are
available or not. If they are, KVM will bypass the eventfd completely
and deliver interrupts directly to the appropriate vCPU. If not, events
are delivered through the eventfd and userspace will receive them when
polling on the eventfd.

By using add_wait_queue_priority(), KVM will be able to consistently
consume events within the kernel without accidentally exposing them
to userspace when they're supposed to be bypassed. This, in turn, means
that userspace doesn't have to jump through hoops to avoid listening
on the erroneously noisy eventfd and injecting duplicate interrupts.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/wait.h | 12 +++++++++++-
 kernel/sched/wait.c  | 17 ++++++++++++++++-
 2 files changed, 27 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Oct. 27, 2020, 7:09 p.m. UTC | #1
On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This allows an exclusive wait_queue_entry to be added at the head of the
> queue, instead of the tail as normal. Thus, it gets to consume events
> first without allowing non-exclusive waiters to be woken at all.
> 
> The (first) intended use is for KVM IRQFD, which currently has

Do you have more? You could easily special case this inside the KVM
code.

I don't _think_ the other users of __add_wait_queue() will mind the
extra branch, but what do I know.

> inconsistent behaviour depending on whether posted interrupts are
> available or not. If they are, KVM will bypass the eventfd completely
> and deliver interrupts directly to the appropriate vCPU. If not, events
> are delivered through the eventfd and userspace will receive them when
> polling on the eventfd.
> 
> By using add_wait_queue_priority(), KVM will be able to consistently
> consume events within the kernel without accidentally exposing them
> to userspace when they're supposed to be bypassed. This, in turn, means
> that userspace doesn't have to jump through hoops to avoid listening
> on the erroneously noisy eventfd and injecting duplicate interrupts.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  include/linux/wait.h | 12 +++++++++++-
>  kernel/sched/wait.c  | 17 ++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27fb99cfeb02..fe10e8570a52 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
>  #define WQ_FLAG_BOOKMARK	0x04
>  #define WQ_FLAG_CUSTOM		0x08
>  #define WQ_FLAG_DONE		0x10
> +#define WQ_FLAG_PRIORITY	0x20
>  
>  /*
>   * A single wait-queue entry structure:
> @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
>  
>  extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
> +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  
>  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
> -	list_add(&wq_entry->entry, &wq_head->head);
> +	struct list_head *head = &wq_head->head;
> +	struct wait_queue_entry *wq;
> +
> +	list_for_each_entry(wq, &wq_head->head, entry) {
> +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> +			break;
> +		head = &wq->entry;
> +	}
> +	list_add(&wq_entry->entry, head);
>  }

So you're adding the PRIORITY things to the head of the list and need
the PRIORITY flag to keep them in FIFO order there, right?

While looking at this I found that weird __add_wait_queue_exclusive()
which is used by fs/eventpoll.c and does something similar, except it
doesn't keep the FIFO order.

The Changelog doesn't state how important this property is to you.

>  /*
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 01f5d3020589..183cc6ae68a6 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
>  }
>  EXPORT_SYMBOL(add_wait_queue_exclusive);
>  
> +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> +{
> +	unsigned long flags;
> +
> +	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
> +	spin_lock_irqsave(&wq_head->lock, flags);
> +	__add_wait_queue(wq_head, wq_entry);
> +	spin_unlock_irqrestore(&wq_head->lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(add_wait_queue_priority);
> +
>  void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
>  	unsigned long flags;
> @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue);
>  /*
>   * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
>   * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
> - * number) then we wake all the non-exclusive tasks and one exclusive task.
> + * number) then we wake that number of exclusive tasks, and potentially all
> + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
> + * the list and any non-exclusive tasks will be woken first. A priority task
> + * may be at the head of the list, and can consume the event without any other
> + * tasks being woken.
>   *
>   * There are circumstances in which we can try to wake a task which has already
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
> -- 
> 2.26.2
>
David Woodhouse Oct. 27, 2020, 7:27 p.m. UTC | #2
On Tue, 2020-10-27 at 20:09 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This allows an exclusive wait_queue_entry to be added at the head of the
> > queue, instead of the tail as normal. Thus, it gets to consume events
> > first without allowing non-exclusive waiters to be woken at all.
> > 
> > The (first) intended use is for KVM IRQFD, which currently has
> 
> Do you have more? You could easily special case this inside the KVM
> code.

I don't have more right now. What is the easy special case that you
see?

> I don't _think_ the other users of __add_wait_queue() will mind the
> extra branch, but what do I know.

I suppose we could add an unlikely() in there. It seemed like premature
optimisation.

> >  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> >  {
> > -	list_add(&wq_entry->entry, &wq_head->head);
> > +	struct list_head *head = &wq_head->head;
> > +	struct wait_queue_entry *wq;
> > +
> > +	list_for_each_entry(wq, &wq_head->head, entry) {
> > +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> > +			break;
> > +		head = &wq->entry;
> > +	}
> > +	list_add(&wq_entry->entry, head);
> >  }
> 
> So you're adding the PRIORITY things to the head of the list and need
> the PRIORITY flag to keep them in FIFO order there, right?

No, I don't care about the order of priority entries; there will
typically be only one of them; that's the point. (I'd have used the
word 'exclusive' if that wasn't already in use for something that...
well... isn't.)

I only case that the priority entries come *before* the bog-standard
non-exclusive entries (like ep_poll_callback).

The priority items end up getting added in FIFO order purely by chance,
because it was simpler to use the same insertion flow for both priority
and normal non-exclusive entries instead of making a new case. So they
all get inserted behind any existing priority entries.

> While looking at this I found that weird __add_wait_queue_exclusive()
> which is used by fs/eventpoll.c and does something similar, except it
> doesn't keep the FIFO order.

It does, doesn't it? Except those so-called "exclusive" entries end up
in FIFO order amongst themselves at the *tail* of the queue, to be
woken up only after all the other entries before them *haven't* been
excluded.

> The Changelog doesn't state how important this property is to you.

Because it isn't :)

The ordering is:

 { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*

I care that PRIORITY comes before the others, because I want to
actually exclude the others. Especially the "non-exclusive" ones, which
the 'exclusive' ones don't actually exclude.

I absolutely don't care about ordering *within* the set of PRIORITY
entries, since as I said I expect there to be only one.
Peter Zijlstra Oct. 27, 2020, 8:30 p.m. UTC | #3
On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:

> > While looking at this I found that weird __add_wait_queue_exclusive()
> > which is used by fs/eventpoll.c and does something similar, except it
> > doesn't keep the FIFO order.
> 
> It does, doesn't it? Except those so-called "exclusive" entries end up
> in FIFO order amongst themselves at the *tail* of the queue, to be
> woken up only after all the other entries before them *haven't* been
> excluded.

__add_wait_queue_exclusive() uses __add_wait_queue() which does
list_add(). It does _not_ add at the tail like normal exclusive users,
and there is exactly _1_ user in tree that does this.

I'm not exactly sure how this happened, but:

  add_wait_queue_exclusive()

and

  __add_wait_queue_exclusive()

are not related :-(

> > The Changelog doesn't state how important this property is to you.
> 
> Because it isn't :)
> 
> The ordering is:
> 
>  { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*
> 
> I care that PRIORITY comes before the others, because I want to
> actually exclude the others. Especially the "non-exclusive" ones, which
> the 'exclusive' ones don't actually exclude.
> 
> I absolutely don't care about ordering *within* the set of PRIORITY
> entries, since as I said I expect there to be only one.

Then you could arguably do something like:

	spin_lock_irqsave(&wq_head->lock, flags);
	__add_wait_queue_exclusive(wq_head, wq_entry);
	spin_unlock_irqrestore(&wq_head->lock, flags);

and leave it at that.

But now I'm itching to fix that horrible naming... tomorrow perhaps.
David Woodhouse Oct. 27, 2020, 8:49 p.m. UTC | #4
On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> 
> > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > which is used by fs/eventpoll.c and does something similar, except it
> > > doesn't keep the FIFO order.
> > 
> > It does, doesn't it? Except those so-called "exclusive" entries end up
> > in FIFO order amongst themselves at the *tail* of the queue, to be
> > woken up only after all the other entries before them *haven't* been
> > excluded.
> 
> __add_wait_queue_exclusive() uses __add_wait_queue() which does
> list_add(). It does _not_ add at the tail like normal exclusive users,
> and there is exactly _1_ user in tree that does this.
> 
> I'm not exactly sure how this happened, but:
> 
>   add_wait_queue_exclusive()
> 
> and
> 
>   __add_wait_queue_exclusive()
> 
> are not related :-(

Oh, that is *particularly* special.

It sounds like the __add_wait_queue_exclusive() version is a half-baked 
attempt at doing what I'm doing here, except....

> > > The Changelog doesn't state how important this property is to you.
> > 
> > Because it isn't :)
> > 
> > The ordering is:
> > 
> >  { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*
> > 
> > I care that PRIORITY comes before the others, because I want to
> > actually exclude the others. Especially the "non-exclusive" ones, which
> > the 'exclusive' ones don't actually exclude.
> > 
> > I absolutely don't care about ordering *within* the set of PRIORITY
> > entries, since as I said I expect there to be only one.
> 
> Then you could arguably do something like:
> 
> 	spin_lock_irqsave(&wq_head->lock, flags);
> 	__add_wait_queue_exclusive(wq_head, wq_entry);
> 	spin_unlock_irqrestore(&wq_head->lock, flags);
> 
> and leave it at that.

.. the problem with that is that other waiters *can* end up on the
queue before it, if they are added later. I don't know if the existing
user (ep_poll) cares, but I do.

> But now I'm itching to fix that horrible naming... tomorrow perhaps.

:)
David Woodhouse Oct. 27, 2020, 9:32 p.m. UTC | #5
On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> 
> > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > which is used by fs/eventpoll.c and does something similar, except it
> > > doesn't keep the FIFO order.
> > 
> > It does, doesn't it? Except those so-called "exclusive" entries end up
> > in FIFO order amongst themselves at the *tail* of the queue, to be
> > woken up only after all the other entries before them *haven't* been
> > excluded.
> 
> __add_wait_queue_exclusive() uses __add_wait_queue() which does
> list_add(). It does _not_ add at the tail like normal exclusive users,
> and there is exactly _1_ user in tree that does this.
> 
> I'm not exactly sure how this happened, but:
> 
>   add_wait_queue_exclusive()
> 
> and
> 
>   __add_wait_queue_exclusive()
> 
> are not related :-(

I think that goes all the way back to here:

https://lkml.org/lkml/2007/5/4/530

It was rounded up in commit d47de16c72and subsequently "cleaned up"
into an inline in wait.h, but I don't think there was ever a reason for
it to be added to the head of the list instead of the tail.

So I think we can reasonably make __add_wait_queue_exclusive() do
precisely the same thing as add_wait_queue_exclusive() does (modulo
locking).

And then potentially rename them both to something that isn't quite
such a lie. And give me the one I want that *does* actually exclude
other waiters :)
Peter Zijlstra Oct. 28, 2020, 2:20 p.m. UTC | #6
On Tue, Oct 27, 2020 at 09:32:11PM +0000, David Woodhouse wrote:
> On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> > 
> > > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > > which is used by fs/eventpoll.c and does something similar, except it
> > > > doesn't keep the FIFO order.
> > > 
> > > It does, doesn't it? Except those so-called "exclusive" entries end up
> > > in FIFO order amongst themselves at the *tail* of the queue, to be
> > > woken up only after all the other entries before them *haven't* been
> > > excluded.
> > 
> > __add_wait_queue_exclusive() uses __add_wait_queue() which does
> > list_add(). It does _not_ add at the tail like normal exclusive users,
> > and there is exactly _1_ user in tree that does this.
> > 
> > I'm not exactly sure how this happened, but:
> > 
> >   add_wait_queue_exclusive()
> > 
> > and
> > 
> >   __add_wait_queue_exclusive()
> > 
> > are not related :-(
> 
> I think that goes all the way back to here:
> 
> https://lkml.org/lkml/2007/5/4/530
> 
> It was rounded up in commit d47de16c72and subsequently "cleaned up"
> into an inline in wait.h, but I don't think there was ever a reason for
> it to be added to the head of the list instead of the tail.

Maybe, I'm not sure I can tell in a hurry. I've opted to undo the above
'cleanups'

> So I think we can reasonably make __add_wait_queue_exclusive() do
> precisely the same thing as add_wait_queue_exclusive() does (modulo
> locking).

Aye, see below.

> And then potentially rename them both to something that isn't quite
> such a lie. And give me the one I want that *does* actually exclude
> other waiters :)

I don't think we want to do that; people are very much used to the
current semantics.

I also very much want to do:
s/__add_wait_queue_entry_tail/__add_wait_queue_tail/ on top of all this.

Anyway, I'll agree to your patch. How do we route this? Shall I take the
waitqueue thing and stick it in a topic branch for Paolo so he can then
merge that and the kvm bits on top into the KVM tree?

---
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4df61129566d..a2a7e1e339f6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1895,10 +1895,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 */
 		eavail = ep_events_available(ep);
 		if (!eavail) {
-			if (signal_pending(current))
+			if (signal_pending(current)) {
 				res = -EINTR;
-			else
-				__add_wait_queue_exclusive(&ep->wq, &wait);
+			} else {
+				wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
+				__add_wait_queue(wq_head, wq_entry);
+			}
 		}
 		write_unlock_irq(&ep->lock);
 
diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c
index 538e839590ef..8cac3589f365 100644
--- a/fs/orangefs/orangefs-bufmap.c
+++ b/fs/orangefs/orangefs-bufmap.c
@@ -86,7 +86,7 @@ static int wait_for_free(struct slot_map *m)
 	do {
 		long n = left, t;
 		if (likely(list_empty(&wait.entry)))
-			__add_wait_queue_entry_tail_exclusive(&m->q, &wait);
+			__add_wait_queue_exclusive(&m->q, &wait);
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		if (m->c > 0)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27fb99cfeb02..4b8c4ece13f7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -171,23 +171,13 @@ static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait
 	list_add(&wq_entry->entry, &wq_head->head);
 }
 
-/*
- * Used for wake-one threads:
- */
-static inline void
-__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
-{
-	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
-	__add_wait_queue(wq_head, wq_entry);
-}
-
 static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	list_add_tail(&wq_entry->entry, &wq_head->head);
 }
 
 static inline void
-__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
 	__add_wait_queue_entry_tail(wq_head, wq_entry);
Peter Zijlstra Oct. 28, 2020, 2:35 p.m. UTC | #7
On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This allows an exclusive wait_queue_entry to be added at the head of the
> queue, instead of the tail as normal. Thus, it gets to consume events
> first without allowing non-exclusive waiters to be woken at all.
> 
> The (first) intended use is for KVM IRQFD, which currently has
> inconsistent behaviour depending on whether posted interrupts are
> available or not. If they are, KVM will bypass the eventfd completely
> and deliver interrupts directly to the appropriate vCPU. If not, events
> are delivered through the eventfd and userspace will receive them when
> polling on the eventfd.
> 
> By using add_wait_queue_priority(), KVM will be able to consistently
> consume events within the kernel without accidentally exposing them
> to userspace when they're supposed to be bypassed. This, in turn, means
> that userspace doesn't have to jump through hoops to avoid listening
> on the erroneously noisy eventfd and injecting duplicate interrupts.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  include/linux/wait.h | 12 +++++++++++-
>  kernel/sched/wait.c  | 17 ++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27fb99cfeb02..fe10e8570a52 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
>  #define WQ_FLAG_BOOKMARK	0x04
>  #define WQ_FLAG_CUSTOM		0x08
>  #define WQ_FLAG_DONE		0x10
> +#define WQ_FLAG_PRIORITY	0x20
>  
>  /*
>   * A single wait-queue entry structure:
> @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
>  
>  extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
> +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  
>  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
> -	list_add(&wq_entry->entry, &wq_head->head);
> +	struct list_head *head = &wq_head->head;
> +	struct wait_queue_entry *wq;
> +
> +	list_for_each_entry(wq, &wq_head->head, entry) {
> +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> +			break;
> +		head = &wq->entry;
> +	}
> +	list_add(&wq_entry->entry, head);
>  }
>  
>  /*
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 01f5d3020589..183cc6ae68a6 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
>  }
>  EXPORT_SYMBOL(add_wait_queue_exclusive);
>  
> +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> +{
> +	unsigned long flags;
> +
> +	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
> +	spin_lock_irqsave(&wq_head->lock, flags);
> +	__add_wait_queue(wq_head, wq_entry);
> +	spin_unlock_irqrestore(&wq_head->lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(add_wait_queue_priority);
> +
>  void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
>  	unsigned long flags;
> @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue);
>  /*
>   * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
>   * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
> - * number) then we wake all the non-exclusive tasks and one exclusive task.
> + * number) then we wake that number of exclusive tasks, and potentially all
> + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
> + * the list and any non-exclusive tasks will be woken first. A priority task
> + * may be at the head of the list, and can consume the event without any other
> + * tasks being woken.
>   *
>   * There are circumstances in which we can try to wake a task which has already
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
> -- 
> 2.26.2
>
Paolo Bonzini Oct. 28, 2020, 2:44 p.m. UTC | #8
On 28/10/20 15:20, Peter Zijlstra wrote:
> Shall I take the waitqueue thing and stick it in a topic branch for
> Paolo so he can then merge that and the kvm bits on top into the KVM
> tree?

Topic branches are always the best solution. :)

Paolo
David Woodhouse Nov. 4, 2020, 9:35 a.m. UTC | #9
On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This allows an exclusive wait_queue_entry to be added at the head of the
> > queue, instead of the tail as normal. Thus, it gets to consume events
> > first without allowing non-exclusive waiters to be woken at all.
> > 
> > The (first) intended use is for KVM IRQFD, which currently has
> > inconsistent behaviour depending on whether posted interrupts are
> > available or not. If they are, KVM will bypass the eventfd completely
> > and deliver interrupts directly to the appropriate vCPU. If not, events
> > are delivered through the eventfd and userspace will receive them when
> > polling on the eventfd.
> > 
> > By using add_wait_queue_priority(), KVM will be able to consistently
> > consume events within the kernel without accidentally exposing them
> > to userspace when they're supposed to be bypassed. This, in turn, means
> > that userspace doesn't have to jump through hoops to avoid listening
> > on the erroneously noisy eventfd and injecting duplicate interrupts.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks. Paolo, the conclusion was that you were going to take this set
through the KVM tree, wasn't it?
Paolo Bonzini Nov. 4, 2020, 11:25 a.m. UTC | #10
On 04/11/20 10:35, David Woodhouse wrote:
> On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This allows an exclusive wait_queue_entry to be added at the head of the
>>> queue, instead of the tail as normal. Thus, it gets to consume events
>>> first without allowing non-exclusive waiters to be woken at all.
>>>
>>> The (first) intended use is for KVM IRQFD, which currently has
>>> inconsistent behaviour depending on whether posted interrupts are
>>> available or not. If they are, KVM will bypass the eventfd completely
>>> and deliver interrupts directly to the appropriate vCPU. If not, events
>>> are delivered through the eventfd and userspace will receive them when
>>> polling on the eventfd.
>>>
>>> By using add_wait_queue_priority(), KVM will be able to consistently
>>> consume events within the kernel without accidentally exposing them
>>> to userspace when they're supposed to be bypassed. This, in turn, means
>>> that userspace doesn't have to jump through hoops to avoid listening
>>> on the erroneously noisy eventfd and injecting duplicate interrupts.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Thanks. Paolo, the conclusion was that you were going to take this set
> through the KVM tree, wasn't it?
> 

Yes.

Paolo
Paolo Bonzini Nov. 6, 2020, 10:17 a.m. UTC | #11
On 04/11/20 10:35, David Woodhouse wrote:
> On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This allows an exclusive wait_queue_entry to be added at the head of the
>>> queue, instead of the tail as normal. Thus, it gets to consume events
>>> first without allowing non-exclusive waiters to be woken at all.
>>>
>>> The (first) intended use is for KVM IRQFD, which currently has
>>> inconsistent behaviour depending on whether posted interrupts are
>>> available or not. If they are, KVM will bypass the eventfd completely
>>> and deliver interrupts directly to the appropriate vCPU. If not, events
>>> are delivered through the eventfd and userspace will receive them when
>>> polling on the eventfd.
>>>
>>> By using add_wait_queue_priority(), KVM will be able to consistently
>>> consume events within the kernel without accidentally exposing them
>>> to userspace when they're supposed to be bypassed. This, in turn, means
>>> that userspace doesn't have to jump through hoops to avoid listening
>>> on the erroneously noisy eventfd and injecting duplicate interrupts.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Thanks. Paolo, the conclusion was that you were going to take this set
> through the KVM tree, wasn't it?
> 

Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
reviewed/acked yet.

Paolo
Alex Williamson Nov. 6, 2020, 4:32 p.m. UTC | #12
On Fri, 6 Nov 2020 11:17:21 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 04/11/20 10:35, David Woodhouse wrote:
> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:  
> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:  
> >>> From: David Woodhouse <dwmw@amazon.co.uk>
> >>>
> >>> This allows an exclusive wait_queue_entry to be added at the head of the
> >>> queue, instead of the tail as normal. Thus, it gets to consume events
> >>> first without allowing non-exclusive waiters to be woken at all.
> >>>
> >>> The (first) intended use is for KVM IRQFD, which currently has
> >>> inconsistent behaviour depending on whether posted interrupts are
> >>> available or not. If they are, KVM will bypass the eventfd completely
> >>> and deliver interrupts directly to the appropriate vCPU. If not, events
> >>> are delivered through the eventfd and userspace will receive them when
> >>> polling on the eventfd.
> >>>
> >>> By using add_wait_queue_priority(), KVM will be able to consistently
> >>> consume events within the kernel without accidentally exposing them
> >>> to userspace when they're supposed to be bypassed. This, in turn, means
> >>> that userspace doesn't have to jump through hoops to avoid listening
> >>> on the erroneously noisy eventfd and injecting duplicate interrupts.
> >>>
> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>  
> >>
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>  
> > 
> > Thanks. Paolo, the conclusion was that you were going to take this set
> > through the KVM tree, wasn't it?
> >   
> 
> Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
> reviewed/acked yet.

There was no vfio patch here, nor mention why it got dropped in v2
afaict.  Thanks,

Alex
David Woodhouse Nov. 6, 2020, 5:18 p.m. UTC | #13
On 6 November 2020 16:32:00 GMT, Alex Williamson <alex.williamson@redhat.com> wrote:
>On Fri, 6 Nov 2020 11:17:21 +0100
>Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> On 04/11/20 10:35, David Woodhouse wrote:
>> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:  
>> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:  
>> >>> From: David Woodhouse <dwmw@amazon.co.uk>
>> >>>
>> >>> This allows an exclusive wait_queue_entry to be added at the head
>of the
>> >>> queue, instead of the tail as normal. Thus, it gets to consume
>events
>> >>> first without allowing non-exclusive waiters to be woken at all.
>> >>>
>> >>> The (first) intended use is for KVM IRQFD, which currently has
>> >>> inconsistent behaviour depending on whether posted interrupts are
>> >>> available or not. If they are, KVM will bypass the eventfd
>completely
>> >>> and deliver interrupts directly to the appropriate vCPU. If not,
>events
>> >>> are delivered through the eventfd and userspace will receive them
>when
>> >>> polling on the eventfd.
>> >>>
>> >>> By using add_wait_queue_priority(), KVM will be able to
>consistently
>> >>> consume events within the kernel without accidentally exposing
>them
>> >>> to userspace when they're supposed to be bypassed. This, in turn,
>means
>> >>> that userspace doesn't have to jump through hoops to avoid
>listening
>> >>> on the erroneously noisy eventfd and injecting duplicate
>interrupts.
>> >>>
>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>  
>> >>
>> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>  
>> > 
>> > Thanks. Paolo, the conclusion was that you were going to take this
>set
>> > through the KVM tree, wasn't it?
>> >   
>> 
>> Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
>> reviewed/acked yet.
>
>There was no vfio patch here, nor mention why it got dropped in v2
>afaict.  Thanks,

That was a different (but related) series. The VFIO one is https://patchwork.kernel.org/project/kvm/patch/20201027135523.646811-3-dwmw2@infradead.org/

Thanks.
diff mbox series

Patch

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27fb99cfeb02..fe10e8570a52 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -22,6 +22,7 @@  int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 #define WQ_FLAG_BOOKMARK	0x04
 #define WQ_FLAG_CUSTOM		0x08
 #define WQ_FLAG_DONE		0x10
+#define WQ_FLAG_PRIORITY	0x20
 
 /*
  * A single wait-queue entry structure:
@@ -164,11 +165,20 @@  static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
 
 extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 
 static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
-	list_add(&wq_entry->entry, &wq_head->head);
+	struct list_head *head = &wq_head->head;
+	struct wait_queue_entry *wq;
+
+	list_for_each_entry(wq, &wq_head->head, entry) {
+		if (!(wq->flags & WQ_FLAG_PRIORITY))
+			break;
+		head = &wq->entry;
+	}
+	list_add(&wq_entry->entry, head);
 }
 
 /*
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 01f5d3020589..183cc6ae68a6 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -37,6 +37,17 @@  void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
 }
 EXPORT_SYMBOL(add_wait_queue_exclusive);
 
+void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+{
+	unsigned long flags;
+
+	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
+	spin_lock_irqsave(&wq_head->lock, flags);
+	__add_wait_queue(wq_head, wq_entry);
+	spin_unlock_irqrestore(&wq_head->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_wait_queue_priority);
+
 void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	unsigned long flags;
@@ -57,7 +68,11 @@  EXPORT_SYMBOL(remove_wait_queue);
 /*
  * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
  * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
- * number) then we wake all the non-exclusive tasks and one exclusive task.
+ * number) then we wake that number of exclusive tasks, and potentially all
+ * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
+ * the list and any non-exclusive tasks will be woken first. A priority task
+ * may be at the head of the list, and can consume the event without any other
+ * tasks being woken.
  *
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns