Message ID | 20201027143944.648769-2-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow KVM IRQFD to consistently intercept events | expand |
On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This allows an exclusive wait_queue_entry to be added at the head of the > queue, instead of the tail as normal. Thus, it gets to consume events > first without allowing non-exclusive waiters to be woken at all. > > The (first) intended use is for KVM IRQFD, which currently has Do you have more? You could easily special case this inside the KVM code. I don't _think_ the other users of __add_wait_queue() will mind the extra branch, but what do I know. > inconsistent behaviour depending on whether posted interrupts are > available or not. If they are, KVM will bypass the eventfd completely > and deliver interrupts directly to the appropriate vCPU. If not, events > are delivered through the eventfd and userspace will receive them when > polling on the eventfd. > > By using add_wait_queue_priority(), KVM will be able to consistently > consume events within the kernel without accidentally exposing them > to userspace when they're supposed to be bypassed. This, in turn, means > that userspace doesn't have to jump through hoops to avoid listening > on the erroneously noisy eventfd and injecting duplicate interrupts. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > include/linux/wait.h | 12 +++++++++++- > kernel/sched/wait.c | 17 ++++++++++++++++- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/linux/wait.h b/include/linux/wait.h > index 27fb99cfeb02..fe10e8570a52 100644 > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int > #define WQ_FLAG_BOOKMARK 0x04 > #define WQ_FLAG_CUSTOM 0x08 > #define WQ_FLAG_DONE 0x10 > +#define WQ_FLAG_PRIORITY 0x20 > > /* > * A single wait-queue entry structure: > @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > - list_add(&wq_entry->entry, &wq_head->head); > + struct list_head *head = &wq_head->head; > + struct wait_queue_entry *wq; > + > + list_for_each_entry(wq, &wq_head->head, entry) { > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > + break; > + head = &wq->entry; > + } > + list_add(&wq_entry->entry, head); > } So you're adding the PRIORITY things to the head of the list and need the PRIORITY flag to keep them in FIFO order there, right? While looking at this I found that weird __add_wait_queue_exclusive() which is used by fs/eventpoll.c and does something similar, except it doesn't keep the FIFO order. The Changelog doesn't state how important this property is to you. > /* > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 01f5d3020589..183cc6ae68a6 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue > } > EXPORT_SYMBOL(add_wait_queue_exclusive); > > +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > +{ > + unsigned long flags; > + > + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; > + spin_lock_irqsave(&wq_head->lock, flags); > + __add_wait_queue(wq_head, wq_entry); > + spin_unlock_irqrestore(&wq_head->lock, flags); > +} > +EXPORT_SYMBOL_GPL(add_wait_queue_priority); > + > void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > unsigned long flags; > @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); > /* > * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just > * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve > - * number) then we wake all the non-exclusive tasks and one exclusive task. > + * number) then we wake that number of exclusive tasks, and potentially all > + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of > + * the list and any non-exclusive tasks will be woken first. A priority task > + * may be at the head of the list, and can consume the event without any other > + * tasks being woken. > * > * There are circumstances in which we can try to wake a task which has already > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > -- > 2.26.2 >
On Tue, 2020-10-27 at 20:09 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This allows an exclusive wait_queue_entry to be added at the head of the > > queue, instead of the tail as normal. Thus, it gets to consume events > > first without allowing non-exclusive waiters to be woken at all. > > > > The (first) intended use is for KVM IRQFD, which currently has > > Do you have more? You could easily special case this inside the KVM > code. I don't have more right now. What is the easy special case that you see? > I don't _think_ the other users of __add_wait_queue() will mind the > extra branch, but what do I know. I suppose we could add an unlikely() in there. It seemed like premature optimisation. > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > > { > > - list_add(&wq_entry->entry, &wq_head->head); > > + struct list_head *head = &wq_head->head; > > + struct wait_queue_entry *wq; > > + > > + list_for_each_entry(wq, &wq_head->head, entry) { > > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > > + break; > > + head = &wq->entry; > > + } > > + list_add(&wq_entry->entry, head); > > } > > So you're adding the PRIORITY things to the head of the list and need > the PRIORITY flag to keep them in FIFO order there, right? No, I don't care about the order of priority entries; there will typically be only one of them; that's the point. (I'd have used the word 'exclusive' if that wasn't already in use for something that... well... isn't.) I only case that the priority entries come *before* the bog-standard non-exclusive entries (like ep_poll_callback). The priority items end up getting added in FIFO order purely by chance, because it was simpler to use the same insertion flow for both priority and normal non-exclusive entries instead of making a new case. So they all get inserted behind any existing priority entries. > While looking at this I found that weird __add_wait_queue_exclusive() > which is used by fs/eventpoll.c and does something similar, except it > doesn't keep the FIFO order. It does, doesn't it? Except those so-called "exclusive" entries end up in FIFO order amongst themselves at the *tail* of the queue, to be woken up only after all the other entries before them *haven't* been excluded. > The Changelog doesn't state how important this property is to you. Because it isn't :) The ordering is: { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* I care that PRIORITY comes before the others, because I want to actually exclude the others. Especially the "non-exclusive" ones, which the 'exclusive' ones don't actually exclude. I absolutely don't care about ordering *within* the set of PRIORITY entries, since as I said I expect there to be only one.
On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > While looking at this I found that weird __add_wait_queue_exclusive() > > which is used by fs/eventpoll.c and does something similar, except it > > doesn't keep the FIFO order. > > It does, doesn't it? Except those so-called "exclusive" entries end up > in FIFO order amongst themselves at the *tail* of the queue, to be > woken up only after all the other entries before them *haven't* been > excluded. __add_wait_queue_exclusive() uses __add_wait_queue() which does list_add(). It does _not_ add at the tail like normal exclusive users, and there is exactly _1_ user in tree that does this. I'm not exactly sure how this happened, but: add_wait_queue_exclusive() and __add_wait_queue_exclusive() are not related :-( > > The Changelog doesn't state how important this property is to you. > > Because it isn't :) > > The ordering is: > > { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* > > I care that PRIORITY comes before the others, because I want to > actually exclude the others. Especially the "non-exclusive" ones, which > the 'exclusive' ones don't actually exclude. > > I absolutely don't care about ordering *within* the set of PRIORITY > entries, since as I said I expect there to be only one. Then you could arguably do something like: spin_lock_irqsave(&wq_head->lock, flags); __add_wait_queue_exclusive(wq_head, wq_entry); spin_unlock_irqrestore(&wq_head->lock, flags); and leave it at that. But now I'm itching to fix that horrible naming... tomorrow perhaps.
On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > which is used by fs/eventpoll.c and does something similar, except it > > > doesn't keep the FIFO order. > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > in FIFO order amongst themselves at the *tail* of the queue, to be > > woken up only after all the other entries before them *haven't* been > > excluded. > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > list_add(). It does _not_ add at the tail like normal exclusive users, > and there is exactly _1_ user in tree that does this. > > I'm not exactly sure how this happened, but: > > add_wait_queue_exclusive() > > and > > __add_wait_queue_exclusive() > > are not related :-( Oh, that is *particularly* special. It sounds like the __add_wait_queue_exclusive() version is a half-baked attempt at doing what I'm doing here, except.... > > > The Changelog doesn't state how important this property is to you. > > > > Because it isn't :) > > > > The ordering is: > > > > { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* > > > > I care that PRIORITY comes before the others, because I want to > > actually exclude the others. Especially the "non-exclusive" ones, which > > the 'exclusive' ones don't actually exclude. > > > > I absolutely don't care about ordering *within* the set of PRIORITY > > entries, since as I said I expect there to be only one. > > Then you could arguably do something like: > > spin_lock_irqsave(&wq_head->lock, flags); > __add_wait_queue_exclusive(wq_head, wq_entry); > spin_unlock_irqrestore(&wq_head->lock, flags); > > and leave it at that. .. the problem with that is that other waiters *can* end up on the queue before it, if they are added later. I don't know if the existing user (ep_poll) cares, but I do. > But now I'm itching to fix that horrible naming... tomorrow perhaps. :)
On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > which is used by fs/eventpoll.c and does something similar, except it > > > doesn't keep the FIFO order. > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > in FIFO order amongst themselves at the *tail* of the queue, to be > > woken up only after all the other entries before them *haven't* been > > excluded. > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > list_add(). It does _not_ add at the tail like normal exclusive users, > and there is exactly _1_ user in tree that does this. > > I'm not exactly sure how this happened, but: > > add_wait_queue_exclusive() > > and > > __add_wait_queue_exclusive() > > are not related :-( I think that goes all the way back to here: https://lkml.org/lkml/2007/5/4/530 It was rounded up in commit d47de16c72and subsequently "cleaned up" into an inline in wait.h, but I don't think there was ever a reason for it to be added to the head of the list instead of the tail. So I think we can reasonably make __add_wait_queue_exclusive() do precisely the same thing as add_wait_queue_exclusive() does (modulo locking). And then potentially rename them both to something that isn't quite such a lie. And give me the one I want that *does* actually exclude other waiters :)
On Tue, Oct 27, 2020 at 09:32:11PM +0000, David Woodhouse wrote: > On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > > which is used by fs/eventpoll.c and does something similar, except it > > > > doesn't keep the FIFO order. > > > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > > in FIFO order amongst themselves at the *tail* of the queue, to be > > > woken up only after all the other entries before them *haven't* been > > > excluded. > > > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > > list_add(). It does _not_ add at the tail like normal exclusive users, > > and there is exactly _1_ user in tree that does this. > > > > I'm not exactly sure how this happened, but: > > > > add_wait_queue_exclusive() > > > > and > > > > __add_wait_queue_exclusive() > > > > are not related :-( > > I think that goes all the way back to here: > > https://lkml.org/lkml/2007/5/4/530 > > It was rounded up in commit d47de16c72and subsequently "cleaned up" > into an inline in wait.h, but I don't think there was ever a reason for > it to be added to the head of the list instead of the tail. Maybe, I'm not sure I can tell in a hurry. I've opted to undo the above 'cleanups' > So I think we can reasonably make __add_wait_queue_exclusive() do > precisely the same thing as add_wait_queue_exclusive() does (modulo > locking). Aye, see below. > And then potentially rename them both to something that isn't quite > such a lie. And give me the one I want that *does* actually exclude > other waiters :) I don't think we want to do that; people are very much used to the current semantics. I also very much want to do: s/__add_wait_queue_entry_tail/__add_wait_queue_tail/ on top of all this. Anyway, I'll agree to your patch. How do we route this? Shall I take the waitqueue thing and stick it in a topic branch for Paolo so he can then merge that and the kvm bits on top into the KVM tree? --- diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4df61129566d..a2a7e1e339f6 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1895,10 +1895,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, */ eavail = ep_events_available(ep); if (!eavail) { - if (signal_pending(current)) + if (signal_pending(current)) { res = -EINTR; - else - __add_wait_queue_exclusive(&ep->wq, &wait); + } else { + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue(wq_head, wq_entry); + } } write_unlock_irq(&ep->lock); diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c index 538e839590ef..8cac3589f365 100644 --- a/fs/orangefs/orangefs-bufmap.c +++ b/fs/orangefs/orangefs-bufmap.c @@ -86,7 +86,7 @@ static int wait_for_free(struct slot_map *m) do { long n = left, t; if (likely(list_empty(&wait.entry))) - __add_wait_queue_entry_tail_exclusive(&m->q, &wait); + __add_wait_queue_exclusive(&m->q, &wait); set_current_state(TASK_INTERRUPTIBLE); if (m->c > 0) diff --git a/include/linux/wait.h b/include/linux/wait.h index 27fb99cfeb02..4b8c4ece13f7 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -171,23 +171,13 @@ static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait list_add(&wq_entry->entry, &wq_head->head); } -/* - * Used for wake-one threads: - */ -static inline void -__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) -{ - wq_entry->flags |= WQ_FLAG_EXCLUSIVE; - __add_wait_queue(wq_head, wq_entry); -} - static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { list_add_tail(&wq_entry->entry, &wq_head->head); } static inline void -__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { wq_entry->flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_entry_tail(wq_head, wq_entry);
On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This allows an exclusive wait_queue_entry to be added at the head of the > queue, instead of the tail as normal. Thus, it gets to consume events > first without allowing non-exclusive waiters to be woken at all. > > The (first) intended use is for KVM IRQFD, which currently has > inconsistent behaviour depending on whether posted interrupts are > available or not. If they are, KVM will bypass the eventfd completely > and deliver interrupts directly to the appropriate vCPU. If not, events > are delivered through the eventfd and userspace will receive them when > polling on the eventfd. > > By using add_wait_queue_priority(), KVM will be able to consistently > consume events within the kernel without accidentally exposing them > to userspace when they're supposed to be bypassed. This, in turn, means > that userspace doesn't have to jump through hoops to avoid listening > on the erroneously noisy eventfd and injecting duplicate interrupts. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/wait.h | 12 +++++++++++- > kernel/sched/wait.c | 17 ++++++++++++++++- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/linux/wait.h b/include/linux/wait.h > index 27fb99cfeb02..fe10e8570a52 100644 > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int > #define WQ_FLAG_BOOKMARK 0x04 > #define WQ_FLAG_CUSTOM 0x08 > #define WQ_FLAG_DONE 0x10 > +#define WQ_FLAG_PRIORITY 0x20 > > /* > * A single wait-queue entry structure: > @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > - list_add(&wq_entry->entry, &wq_head->head); > + struct list_head *head = &wq_head->head; > + struct wait_queue_entry *wq; > + > + list_for_each_entry(wq, &wq_head->head, entry) { > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > + break; > + head = &wq->entry; > + } > + list_add(&wq_entry->entry, head); > } > > /* > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 01f5d3020589..183cc6ae68a6 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue > } > EXPORT_SYMBOL(add_wait_queue_exclusive); > > +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > +{ > + unsigned long flags; > + > + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; > + spin_lock_irqsave(&wq_head->lock, flags); > + __add_wait_queue(wq_head, wq_entry); > + spin_unlock_irqrestore(&wq_head->lock, flags); > +} > +EXPORT_SYMBOL_GPL(add_wait_queue_priority); > + > void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > unsigned long flags; > @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); > /* > * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just > * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve > - * number) then we wake all the non-exclusive tasks and one exclusive task. > + * number) then we wake that number of exclusive tasks, and potentially all > + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of > + * the list and any non-exclusive tasks will be woken first. A priority task > + * may be at the head of the list, and can consume the event without any other > + * tasks being woken. > * > * There are circumstances in which we can try to wake a task which has already > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > -- > 2.26.2 >
On 28/10/20 15:20, Peter Zijlstra wrote: > Shall I take the waitqueue thing and stick it in a topic branch for > Paolo so he can then merge that and the kvm bits on top into the KVM > tree? Topic branches are always the best solution. :) Paolo
On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This allows an exclusive wait_queue_entry to be added at the head of the > > queue, instead of the tail as normal. Thus, it gets to consume events > > first without allowing non-exclusive waiters to be woken at all. > > > > The (first) intended use is for KVM IRQFD, which currently has > > inconsistent behaviour depending on whether posted interrupts are > > available or not. If they are, KVM will bypass the eventfd completely > > and deliver interrupts directly to the appropriate vCPU. If not, events > > are delivered through the eventfd and userspace will receive them when > > polling on the eventfd. > > > > By using add_wait_queue_priority(), KVM will be able to consistently > > consume events within the kernel without accidentally exposing them > > to userspace when they're supposed to be bypassed. This, in turn, means > > that userspace doesn't have to jump through hoops to avoid listening > > on the erroneously noisy eventfd and injecting duplicate interrupts. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks. Paolo, the conclusion was that you were going to take this set through the KVM tree, wasn't it?
On 04/11/20 10:35, David Woodhouse wrote: > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> This allows an exclusive wait_queue_entry to be added at the head of the >>> queue, instead of the tail as normal. Thus, it gets to consume events >>> first without allowing non-exclusive waiters to be woken at all. >>> >>> The (first) intended use is for KVM IRQFD, which currently has >>> inconsistent behaviour depending on whether posted interrupts are >>> available or not. If they are, KVM will bypass the eventfd completely >>> and deliver interrupts directly to the appropriate vCPU. If not, events >>> are delivered through the eventfd and userspace will receive them when >>> polling on the eventfd. >>> >>> By using add_wait_queue_priority(), KVM will be able to consistently >>> consume events within the kernel without accidentally exposing them >>> to userspace when they're supposed to be bypassed. This, in turn, means >>> that userspace doesn't have to jump through hoops to avoid listening >>> on the erroneously noisy eventfd and injecting duplicate interrupts. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Thanks. Paolo, the conclusion was that you were going to take this set > through the KVM tree, wasn't it? > Yes. Paolo
On 04/11/20 10:35, David Woodhouse wrote: > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> This allows an exclusive wait_queue_entry to be added at the head of the >>> queue, instead of the tail as normal. Thus, it gets to consume events >>> first without allowing non-exclusive waiters to be woken at all. >>> >>> The (first) intended use is for KVM IRQFD, which currently has >>> inconsistent behaviour depending on whether posted interrupts are >>> available or not. If they are, KVM will bypass the eventfd completely >>> and deliver interrupts directly to the appropriate vCPU. If not, events >>> are delivered through the eventfd and userspace will receive them when >>> polling on the eventfd. >>> >>> By using add_wait_queue_priority(), KVM will be able to consistently >>> consume events within the kernel without accidentally exposing them >>> to userspace when they're supposed to be bypassed. This, in turn, means >>> that userspace doesn't have to jump through hoops to avoid listening >>> on the erroneously noisy eventfd and injecting duplicate interrupts. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Thanks. Paolo, the conclusion was that you were going to take this set > through the KVM tree, wasn't it? > Queued, except for patch 2/3 in the eventfd series which Alex hasn't reviewed/acked yet. Paolo
On Fri, 6 Nov 2020 11:17:21 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 04/11/20 10:35, David Woodhouse wrote: > > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: > >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > >>> From: David Woodhouse <dwmw@amazon.co.uk> > >>> > >>> This allows an exclusive wait_queue_entry to be added at the head of the > >>> queue, instead of the tail as normal. Thus, it gets to consume events > >>> first without allowing non-exclusive waiters to be woken at all. > >>> > >>> The (first) intended use is for KVM IRQFD, which currently has > >>> inconsistent behaviour depending on whether posted interrupts are > >>> available or not. If they are, KVM will bypass the eventfd completely > >>> and deliver interrupts directly to the appropriate vCPU. If not, events > >>> are delivered through the eventfd and userspace will receive them when > >>> polling on the eventfd. > >>> > >>> By using add_wait_queue_priority(), KVM will be able to consistently > >>> consume events within the kernel without accidentally exposing them > >>> to userspace when they're supposed to be bypassed. This, in turn, means > >>> that userspace doesn't have to jump through hoops to avoid listening > >>> on the erroneously noisy eventfd and injecting duplicate interrupts. > >>> > >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > >> > >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > > Thanks. Paolo, the conclusion was that you were going to take this set > > through the KVM tree, wasn't it? > > > > Queued, except for patch 2/3 in the eventfd series which Alex hasn't > reviewed/acked yet. There was no vfio patch here, nor mention why it got dropped in v2 afaict. Thanks, Alex
On 6 November 2020 16:32:00 GMT, Alex Williamson <alex.williamson@redhat.com> wrote: >On Fri, 6 Nov 2020 11:17:21 +0100 >Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 04/11/20 10:35, David Woodhouse wrote: >> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >> >>> From: David Woodhouse <dwmw@amazon.co.uk> >> >>> >> >>> This allows an exclusive wait_queue_entry to be added at the head >of the >> >>> queue, instead of the tail as normal. Thus, it gets to consume >events >> >>> first without allowing non-exclusive waiters to be woken at all. >> >>> >> >>> The (first) intended use is for KVM IRQFD, which currently has >> >>> inconsistent behaviour depending on whether posted interrupts are >> >>> available or not. If they are, KVM will bypass the eventfd >completely >> >>> and deliver interrupts directly to the appropriate vCPU. If not, >events >> >>> are delivered through the eventfd and userspace will receive them >when >> >>> polling on the eventfd. >> >>> >> >>> By using add_wait_queue_priority(), KVM will be able to >consistently >> >>> consume events within the kernel without accidentally exposing >them >> >>> to userspace when they're supposed to be bypassed. This, in turn, >means >> >>> that userspace doesn't have to jump through hoops to avoid >listening >> >>> on the erroneously noisy eventfd and injecting duplicate >interrupts. >> >>> >> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> > >> > Thanks. Paolo, the conclusion was that you were going to take this >set >> > through the KVM tree, wasn't it? >> > >> >> Queued, except for patch 2/3 in the eventfd series which Alex hasn't >> reviewed/acked yet. > >There was no vfio patch here, nor mention why it got dropped in v2 >afaict. Thanks, That was a different (but related) series. The VFIO one is https://patchwork.kernel.org/project/kvm/patch/20201027135523.646811-3-dwmw2@infradead.org/ Thanks.
diff --git a/include/linux/wait.h b/include/linux/wait.h index 27fb99cfeb02..fe10e8570a52 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int #define WQ_FLAG_BOOKMARK 0x04 #define WQ_FLAG_CUSTOM 0x08 #define WQ_FLAG_DONE 0x10 +#define WQ_FLAG_PRIORITY 0x20 /* * A single wait-queue entry structure: @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { - list_add(&wq_entry->entry, &wq_head->head); + struct list_head *head = &wq_head->head; + struct wait_queue_entry *wq; + + list_for_each_entry(wq, &wq_head->head, entry) { + if (!(wq->flags & WQ_FLAG_PRIORITY)) + break; + head = &wq->entry; + } + list_add(&wq_entry->entry, head); } /* diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 01f5d3020589..183cc6ae68a6 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue } EXPORT_SYMBOL(add_wait_queue_exclusive); +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +{ + unsigned long flags; + + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; + spin_lock_irqsave(&wq_head->lock, flags); + __add_wait_queue(wq_head, wq_entry); + spin_unlock_irqrestore(&wq_head->lock, flags); +} +EXPORT_SYMBOL_GPL(add_wait_queue_priority); + void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); /* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve - * number) then we wake all the non-exclusive tasks and one exclusive task. + * number) then we wake that number of exclusive tasks, and potentially all + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of + * the list and any non-exclusive tasks will be woken first. A priority task + * may be at the head of the list, and can consume the event without any other + * tasks being woken. * * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns