[2/4] swait: add the missing killable swaits

Message ID	20170614222017.14653-3-mcgrof@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A3B28239CF From: "Luis R. Rodriguez" <mcgrof@kernel.org> To: gregkh@linuxfoundation.org Cc: mfuzzey@parkeon.com, ebiederm@xmission.com, dmitry.torokhov@gmail.com, wagi@monom.org, dwmw2@infradead.org, jewalt@lgsinnovations.com, rafal@milecki.pl, arend.vanspriel@broadcom.com, rjw@rjwysocki.net, yi1.li@linux.intel.com, atull@kernel.org, moritz.fischer@ettus.com, pmladek@suse.com, johannes.berg@intel.com, emmanuel.grumbach@intel.com, luciano.coelho@intel.com, kvalo@codeaurora.org, luto@kernel.org, torvalds@linux-foundation.org, keescook@chromium.org, takahiro.akashi@linaro.org, dhowells@redhat.com, pjones@redhat.com, hdegoede@redhat.com, alan@linux.intel.com, tytso@mit.edu, mtk.manpages@gmail.com, paul.gortmaker@windriver.com, mtosatti@redhat.com, mawilcox@microsoft.com, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, "Luis R. Rodriguez" <mcgrof@kernel.org>, "stable # 4 . 6" <stable@vger.kernel.org> Subject: [PATCH 2/4] swait: add the missing killable swaits Date: Wed, 14 Jun 2017 15:20:15 -0700 Message-Id: <20170614222017.14653-3-mcgrof@kernel.org> In-Reply-To: <20170614222017.14653-1-mcgrof@kernel.org> References: <20170614222017.14653-1-mcgrof@kernel.org> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Luis Chamberlain June 14, 2017, 10:20 p.m. UTC

Code in kernel which incorrectly used the non-killable variants could
end up having waits killed improperly. The respective killable waits
have been upstream for a while:

  o wait_for_completion_killable()
  o wait_for_completion_killable_timeout()

swait has been upstream since v4.6. Older kernels have had the
above variants in place for a long time.

Cc: stable <stable@vger.kernel.org> # 4.6
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
---
 include/linux/swait.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Greg Kroah-Hartman June 29, 2017, 12:54 p.m. UTC | #1

On Wed, Jun 14, 2017 at 03:20:15PM -0700, Luis R. Rodriguez wrote:
> Code in kernel which incorrectly used the non-killable variants could
> end up having waits killed improperly. The respective killable waits
> have been upstream for a while:
> 
>   o wait_for_completion_killable()
>   o wait_for_completion_killable_timeout()
> 
> swait has been upstream since v4.6. Older kernels have had the
> above variants in place for a long time.
> 
> Cc: stable <stable@vger.kernel.org> # 4.6
> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
> ---
>  include/linux/swait.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/linux/swait.h b/include/linux/swait.h
> index c1f9c62a8a50..2c700694d50a 100644
> --- a/include/linux/swait.h
> +++ b/include/linux/swait.h
> @@ -169,4 +169,29 @@ do {									\
>  	__ret;								\
>  })
>  
> +#define __swait_event_killable(wq, condition)				\
> +	___swait_event(wq, condition, TASK_KILLABLE, 0, schedule())
> +
> +#define swait_event_killable(wq, condition)				\
> +({									\
> +	int __ret = 0;							\
> +	if (!(condition))						\
> +		__ret = __swait_event_killable(wq, condition);		\
> +	__ret;								\
> +})
> +
> +#define __swait_event_killable_timeout(wq, condition, timeout)		\
> +	___swait_event(wq, ___wait_cond_timeout(condition),		\
> +		       TASK_KILLABLE, timeout,				\
> +		       __ret = schedule_timeout(__ret))
> +
> +#define swait_event_killable_timeout(wq, condition, timeout)		\
> +({									\
> +	long __ret = timeout;						\
> +	if (!___wait_cond_timeout(condition))				\
> +		__ret = __swait_event_killable_timeout(wq,		\
> +						condition, timeout);	\
> +	__ret;								\
> +})
> +
>  #endif /* _LINUX_SWAIT_H */

Do you really still want to add these, now that we know we shouldn't be
using swait in "real" code? :)

thanks,

greg k-h

Thomas Gleixner June 29, 2017, 1:05 p.m. UTC | #2

On Thu, 29 Jun 2017, Greg KH wrote:
> On Wed, Jun 14, 2017 at 03:20:15PM -0700, Luis R. Rodriguez wrote:
> Do you really still want to add these, now that we know we shouldn't be
> using swait in "real" code? :)

And who defined that it should not be used in real code?

Thanks,

	tglx

Greg Kroah-Hartman June 29, 2017, 1:35 p.m. UTC | #3

On Thu, Jun 29, 2017 at 03:05:26PM +0200, Thomas Gleixner wrote:
> On Thu, 29 Jun 2017, Greg KH wrote:
> > On Wed, Jun 14, 2017 at 03:20:15PM -0700, Luis R. Rodriguez wrote:
> > Do you really still want to add these, now that we know we shouldn't be
> > using swait in "real" code? :)
> 
> And who defined that it should not be used in real code?

Linus did, in a different firmware thread.  You have to _really_ know
what you are doing to use this interface, and the firmware interface
shouldn't be using it.  So adding new apis just for firmware does not
seem like a wise decision at this point in time.

thanks,

greg k-h

Thomas Gleixner June 29, 2017, 1:46 p.m. UTC | #4

On Thu, 29 Jun 2017, Greg KH wrote:
> On Thu, Jun 29, 2017 at 03:05:26PM +0200, Thomas Gleixner wrote:
> > On Thu, 29 Jun 2017, Greg KH wrote:
> > > On Wed, Jun 14, 2017 at 03:20:15PM -0700, Luis R. Rodriguez wrote:
> > > Do you really still want to add these, now that we know we shouldn't be
> > > using swait in "real" code? :)
> > 
> > And who defined that it should not be used in real code?
> 
> Linus did, in a different firmware thread.  You have to _really_ know
> what you are doing to use this interface, and the firmware interface
> shouldn't be using it.  So adding new apis just for firmware does not
> seem like a wise decision at this point in time.

So it's not about code in general, it's about a particular piece of
code. Fair enough.

Thanks,

	tglx

Linus Torvalds June 29, 2017, 4:13 p.m. UTC | #5

On Thu, Jun 29, 2017 at 6:46 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 29 Jun 2017, Greg KH wrote:
>> On Thu, Jun 29, 2017 at 03:05:26PM +0200, Thomas Gleixner wrote:
>> >
>> > And who defined that it should not be used in real code?
>>
>> Linus did, in a different firmware thread.  You have to _really_ know
>> what you are doing to use this interface, and the firmware interface
>> shouldn't be using it.  So adding new apis just for firmware does not
>> seem like a wise decision at this point in time.
>
> So it's not about code in general, it's about a particular piece of
> code. Fair enough.

Well, I'd actually say it the other way around: swait should not be
used in general, only in _very_ particular pieces of code that
actually explicitly need the odd swait semantics.

swait uses special locking and has odd semantics that are not at all
the same as the default wait queue ones. It should not be used without
very strong reasons (and honestly, the only strong enough reason seems
to be "RT").

The special locking means that swait doesn't do DEBUG_LOCK_ALLOC, but
it also means that it doesn't even work in all contexts.

So "swake_up()" does surprising things (only wake up one - that's what
caused a firmware loading bug), and "swake_up_all()" has magic rules
about interrupt disabling.

The thing is simply a collection of small hacks and should NOT be used
in general.

I never want to see a driver use that code, for example. It was
designed for RCU and RT, and it should damn well be limited to that.

              Linus

Matthew Wilcox June 29, 2017, 4:31 p.m. UTC | #6

Linus Torvalds wrote:
> The thing is simply a collection of small hacks and should NOT be used

> in general.

> 

> I never want to see a driver use that code, for example. It was

> designed for RCU and RT, and it should damn well be limited to that.

Maybe put #ifndef MODULE around the entire include file then?
And definitely remove this line:

* One would recommend using this wait queue where possible.

Luis Chamberlain June 29, 2017, 5:29 p.m. UTC | #7

On Thu, Jun 29, 2017 at 04:31:08PM +0000, Matthew Wilcox wrote:
> Linus Torvalds wrote:
> > The thing is simply a collection of small hacks and should NOT be used
> > in general.
> > 
> > I never want to see a driver use that code, for example. It was
> > designed for RCU and RT, and it should damn well be limited to that.
> 
> Maybe put #ifndef MODULE around the entire include file then?
> And definitely remove this line:
> 
> * One would recommend using this wait queue where possible.

I'll respin and port away from swait, and add the #ifndef MODULE and special
note on swait.h.

  Luis

Davidlohr Bueso June 29, 2017, 5:40 p.m. UTC | #8

On Thu, 29 Jun 2017, Linus Torvalds wrote:

>Well, I'd actually say it the other way around: swait should not be
>used in general, only in _very_ particular pieces of code that
>actually explicitly need the odd swait semantics.
>
>swait uses special locking and has odd semantics that are not at all
>the same as the default wait queue ones. It should not be used without
>very strong reasons (and honestly, the only strong enough reason seems
>to be "RT").
>
>The special locking means that swait doesn't do DEBUG_LOCK_ALLOC, but
>it also means that it doesn't even work in all contexts.
>
>So "swake_up()" does surprising things (only wake up one - that's what
>caused a firmware loading bug), and "swake_up_all()" has magic rules
>about interrupt disabling.
>
>The thing is simply a collection of small hacks and should NOT be used
>in general.

For all the above, what do you think of my 'sswait' proposal?

Thanks,
Davidlohr

Linus Torvalds June 29, 2017, 5:57 p.m. UTC | #9

On Thu, Jun 29, 2017 at 10:40 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> For all the above, what do you think of my 'sswait' proposal?

I see no actual users of such a specialty interface.

I don't think we've *ever* had any actual problems with our current
wait-queues, apart from the RT issues, which were not about the
waitqueues themselves, but purely about RT itself.

So without some very compelling reason, I'd not want to add yet
another wait-queue.

I actually think swait is pure garbage. Most users only wake up one
process anyway, and using swait for that is stupid. If you only wake
up one, you might as well just have a single process pointer, not a
wait list at all, and then use "wake_up_process()".

There is *one* single user of swake_up_all(), and that one looks like
bogus crap also: it does it outside of the spinlock that could have
been used to protect the queue - p,lus I'm not sure there's really a
queue anyway, since I think it's just the grace-period kthread that is
there.

kvm uses swait, but doesn't use swake_up_all(), so it always just
wakes up a single waiter. That may be the right thing to do. Or it's
just another bug. I don't know. The KVM use looks broken too, since it
does

        if (swait_active(wqp)) {
                swake_up(wqp);

which is racy wrt new waiters, which implies that there is some
upper-level synchronization - possibly the same "only one thread
anyway".

So swake really looks like crap. It has crap semantics, it has crap
users, it's just broken.

The last thing we want to do is to create something _else_ specialized
like this.

                Linus

Davidlohr Bueso June 29, 2017, 6:33 p.m. UTC | #10

On Thu, 29 Jun 2017, Linus Torvalds wrote:

>So without some very compelling reason, I'd not want to add yet
>another wait-queue.

Yes, I was expecting this and very much agree.

I'll actually take a look at wake_q for wake_up_all() and co. to see if
we can reduce the spinlock hold times. Of course it would only make sense
for more than a one wakeup.

>I actually think swait is pure garbage. Most users only wake up one
>process anyway, and using swait for that is stupid. If you only wake
>up one, you might as well just have a single process pointer, not a
>wait list at all, and then use "wake_up_process()".

But you still need the notion of a queue, even if you wake one task
at a time... I'm probably missing your point here.

>There is *one* single user of swake_up_all(), and that one looks like
>bogus crap also: it does it outside of the spinlock that could have
>been used to protect the queue - p,lus I'm not sure there's really a
>queue anyway, since I think it's just the grace-period kthread that is
>there.

So those cases when there's only one waiter I completely agree should
not be using waitqueues. pcpu-rwsems in the past suffered from this.

Thanks,
Davidlohr

Linus Torvalds June 29, 2017, 6:59 p.m. UTC | #11

On Thu, Jun 29, 2017 at 11:33 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Thu, 29 Jun 2017, Linus Torvalds wrote:
>
>> I actually think swait is pure garbage. Most users only wake up one
>> process anyway, and using swait for that is stupid. If you only wake
>> up one, you might as well just have a single process pointer, not a
>> wait list at all, and then use "wake_up_process()".
>
> But you still need the notion of a queue, even if you wake one task
> at a time... I'm probably missing your point here.

The *reason* they wake up only one seems to be that there really is
just one. It's some per-cpu idle thread for kvm, and for RCU it's the
RCU workqueue thread.

So the queue literally looks suspiciously pointless.

But I might be wrong, and there can actually be multiple entries. If
there are, I don't see why the wake-up-one semantics the code uses
would be valid, though.

                   Linus

Marcelo Tosatti June 29, 2017, 7:15 p.m. UTC | #12

On Thu, Jun 29, 2017 at 09:13:29AM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 6:46 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Thu, 29 Jun 2017, Greg KH wrote:
> >> On Thu, Jun 29, 2017 at 03:05:26PM +0200, Thomas Gleixner wrote:
> >> >
> >> > And who defined that it should not be used in real code?
> >>
> >> Linus did, in a different firmware thread.  You have to _really_ know
> >> what you are doing to use this interface, and the firmware interface
> >> shouldn't be using it.  So adding new apis just for firmware does not
> >> seem like a wise decision at this point in time.
> >
> > So it's not about code in general, it's about a particular piece of
> > code. Fair enough.
> 
> Well, I'd actually say it the other way around: swait should not be
> used in general, only in _very_ particular pieces of code that
> actually explicitly need the odd swait semantics.
> 
> swait uses special locking and has odd semantics that are not at all
> the same as the default wait queue ones. It should not be used without
> very strong reasons (and honestly, the only strong enough reason seems
> to be "RT").

Performance shortcut:

https://lkml.org/lkml/2016/2/25/301

> The special locking means that swait doesn't do DEBUG_LOCK_ALLOC, but
> it also means that it doesn't even work in all contexts.
> 
> So "swake_up()" does surprising things (only wake up one - that's what
> caused a firmware loading bug), and "swake_up_all()" has magic rules
> about interrupt disabling.
> 
> The thing is simply a collection of small hacks and should NOT be used
> in general.

Its a very smart performance speed up ;-)

> I never want to see a driver use that code, for example. It was
> designed for RCU and RT, and it should damn well be limited to that.
> 
>               Linus

If KVM is the only user, feel free to remove it, you're past the point
where that performance improvement matters (due to VMX hardware
improvements).

Luis Chamberlain June 29, 2017, 7:40 p.m. UTC | #13

On Thu, Jun 29, 2017 at 11:59:29AM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 11:33 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> > On Thu, 29 Jun 2017, Linus Torvalds wrote:
> >
> >> I actually think swait is pure garbage. Most users only wake up one
> >> process anyway, and using swait for that is stupid. If you only wake
> >> up one, you might as well just have a single process pointer, not a
> >> wait list at all, and then use "wake_up_process()".
> >
> > But you still need the notion of a queue, even if you wake one task
> > at a time... I'm probably missing your point here.
> 
> The *reason* they wake up only one seems to be that there really is
> just one. It's some per-cpu idle thread for kvm, and for RCU it's the
> RCU workqueue thread.
> 
> So the queue literally looks suspiciously pointless.
> 
> But I might be wrong, and there can actually be multiple entries.

Since this swake_up() --> swake_up_all() reportedly *fixed* the one wake up
issue it would seem this does queue [0]. That said, I don't see any simple tests
tools/testing/selftests/swait but then again we don't have test for regular
waits either...

[0] https://bugzilla.kernel.org/show_bug.cgi?id=195477

> If there are, I don't see why the wake-up-one semantics the code uses
> would be valid, though.

Not sure what's wrong with it?

I believe one use case for example is for when we know that waker alone would
be able to ensure the next item in queue will also be woken up. Such was the
case for the kmod.c conversion I tested, and behold it seemed to have wored
with just swake_up(). Its obviously *fragile* though given you *assume* error
cases also wake up. In the case of kmod.c we have no such error cases but in
firmware_class.c we *do*, and actually that is part of the next set of fixes I
have to address next, but that issue would be present even if we move to wait
for completion and complete_all() is used.

  Luis

Luis Chamberlain June 29, 2017, 7:44 p.m. UTC | #14

On Thu, Jun 29, 2017 at 09:40:15PM +0200, Luis R. Rodriguez wrote:
> On Thu, Jun 29, 2017 at 11:59:29AM -0700, Linus Torvalds wrote:
> > On Thu, Jun 29, 2017 at 11:33 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> > > On Thu, 29 Jun 2017, Linus Torvalds wrote:
> > >
> > >> I actually think swait is pure garbage. Most users only wake up one
> > >> process anyway, and using swait for that is stupid. If you only wake
> > >> up one, you might as well just have a single process pointer, not a
> > >> wait list at all, and then use "wake_up_process()".
> > >
> > > But you still need the notion of a queue, even if you wake one task
> > > at a time... I'm probably missing your point here.
> > 
> > The *reason* they wake up only one seems to be that there really is
> > just one. It's some per-cpu idle thread for kvm, and for RCU it's the
> > RCU workqueue thread.
> > 
> > So the queue literally looks suspiciously pointless.
> > 
> > But I might be wrong, and there can actually be multiple entries.
> 
> Since this swake_up() --> swake_up_all() reportedly *fixed* the one wake up
> issue it would seem this does queue [0]. That said, I don't see any simple tests
> tools/testing/selftests/swait but then again we don't have test for regular
> waits either...
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477

I should also note that the swake_up_all() should have only helped in cases where
3 cards were used, as if only 2 were used that should have been covered by just
the swake_up(). Unless of course I hear otherwise by the reporter, Nicolas or
from Jakub.

  Luis

Linus Torvalds June 29, 2017, 8:57 p.m. UTC | #15

On Thu, Jun 29, 2017 at 12:40 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> On Thu, Jun 29, 2017 at 11:59:29AM -0700, Linus Torvalds wrote:
>> > at a time... I'm probably missing your point here.
>>
>> The *reason* they wake up only one seems to be that there really is
>> just one. It's some per-cpu idle thread for kvm, and for RCU it's the
>> RCU workqueue thread.
>>
>> So the queue literally looks suspiciously pointless.
>>
>> But I might be wrong, and there can actually be multiple entries.
>
> Since this swake_up() --> swake_up_all() reportedly *fixed* the one wake up
> issue it would seem this does queue [0].

I'm not talking about the firmware code.

That thing never had an excuse to use swait in the first place.

I'm talking about kvm and rcu, which *do* have excuses to use it, but
where I argue that swait is _still_ a questionable interface for other
reasons.

                Linus

Jakub Kicinski June 29, 2017, 8:58 p.m. UTC | #16

On Thu, 29 Jun 2017 21:44:55 +0200, Luis R. Rodriguez wrote:
> On Thu, Jun 29, 2017 at 09:40:15PM +0200, Luis R. Rodriguez wrote:
> > On Thu, Jun 29, 2017 at 11:59:29AM -0700, Linus Torvalds wrote:  
> > > On Thu, Jun 29, 2017 at 11:33 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:  
> > > > On Thu, 29 Jun 2017, Linus Torvalds wrote:
> > > >  
> > > >> I actually think swait is pure garbage. Most users only wake up one
> > > >> process anyway, and using swait for that is stupid. If you only wake
> > > >> up one, you might as well just have a single process pointer, not a
> > > >> wait list at all, and then use "wake_up_process()".  
> > > >
> > > > But you still need the notion of a queue, even if you wake one task
> > > > at a time... I'm probably missing your point here.  
> > > 
> > > The *reason* they wake up only one seems to be that there really is
> > > just one. It's some per-cpu idle thread for kvm, and for RCU it's the
> > > RCU workqueue thread.
> > > 
> > > So the queue literally looks suspiciously pointless.
> > > 
> > > But I might be wrong, and there can actually be multiple entries.  
> > 
> > Since this swake_up() --> swake_up_all() reportedly *fixed* the one wake up
> > issue it would seem this does queue [0]. That said, I don't see any simple tests
> > tools/testing/selftests/swait but then again we don't have test for regular
> > waits either...
> > 
> > [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477  
> 
> I should also note that the swake_up_all() should have only helped in cases where
> 3 cards were used, as if only 2 were used that should have been covered by just
> the swake_up(). Unless of course I hear otherwise by the reporter, Nicolas or
> from Jakub.

I was hitting this with 2 cards.

Linus Torvalds June 30, 2017, 4:03 a.m. UTC | #17

On Thu, Jun 29, 2017 at 12:15 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Thu, Jun 29, 2017 at 09:13:29AM -0700, Linus Torvalds wrote:
>>
>> swait uses special locking and has odd semantics that are not at all
>> the same as the default wait queue ones. It should not be used without
>> very strong reasons (and honestly, the only strong enough reason seems
>> to be "RT").
>
> Performance shortcut:
>
> https://lkml.org/lkml/2016/2/25/301

Yes, I know why kvm uses it, I just don't think it's necessarily the
right thing.

That kvm commit is actually a great example: it uses swake_up() from
an interrupt, and that's in fact the *reason* it uses swake_up().

But that also fundamentally means that it cannot use swake_up_all(),
so it basically *relies* on there only ever being one single entry
that needs to be woken up.

And as far as I can tell, it really is because the queue only ever has
one entry (ie it's per-vcpu, and when the vcpu is blocked, it's
blocked - so no other user will be waiting there).

So it isn't that you migth queue multiple entries and then just wake
them up one at a time. There really is just one entry at a time,
right?

And that means that swait is actuially completely the wrong thing to
do. It's more expensive and more complex than just saving the single
process pointer away and just doing "wake_up_process()".

Now, it really is entirely possible that I'm missing something, but it
does look like that to me.

We've had wake_up_process() since pretty much day #1. THAT is the
fastest and simplest direct wake-up there is, not some "simple
wait-queue".

Now, admittedly I don't know the code and really may be entirely off,
but looking at the commit (no need to go to the lkml archives - it's
commit 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") in
mainline), I really think the swait() use is simply not correct if
there can be multiple waiters, exactly because swake_up() only wakes
up a single entry.

So either there is only a single entry, or *all* the code like

        dvcpu->arch.wait = 0;

-       if (waitqueue_active(&dvcpu->wq))
-               wake_up_interruptible(&dvcpu->wq);
+       if (swait_active(&dvcpu->wq))
+               swake_up(&dvcpu->wq);

is simply wrong. If there are multiple blockers, and you just cleared
"arch.wait", I think they should *all* be woken up. And that's not
what swake_up() does.

So I think that kvm_vcpu_block() could easily have instead done

    vcpu->process = current;

as the "prepare_to_wait()" part, and "finish_wait()" would be to just
clear vcpu->process. No wait-queue, just a single pointer to the
single blocking thread.

(Of course, you still need serialization, so that
"wake_up_process(vcpu->process)" doesn't end up using a stale value,
but since processes are already freed with RCU because of other things
like that, the serialization is very low-cost, you only need to be
RCU-read safe when waking up).

See what I'm saying?

Note that "wake_up_process()" really is fairly widely used. It's
widely used because it's fairly obvious, and because that really *is*
the lowest-possible cost: a single pointer to the sleeping thread, and
you can often do almost no locking at all.

And unlike swake_up(), it's obvious that you only wake up a single thread.

           Linus

Marcelo Tosatti June 30, 2017, 11:55 a.m. UTC | #18

On Thu, Jun 29, 2017 at 09:03:42PM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 12:15 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Jun 29, 2017 at 09:13:29AM -0700, Linus Torvalds wrote:
> >>
> >> swait uses special locking and has odd semantics that are not at all
> >> the same as the default wait queue ones. It should not be used without
> >> very strong reasons (and honestly, the only strong enough reason seems
> >> to be "RT").
> >
> > Performance shortcut:
> >
> > https://lkml.org/lkml/2016/2/25/301
> 
> Yes, I know why kvm uses it, I just don't think it's necessarily the
> right thing.
> 
> That kvm commit is actually a great example: it uses swake_up() from
> an interrupt, and that's in fact the *reason* it uses swake_up().
> 
> But that also fundamentally means that it cannot use swake_up_all(),
> so it basically *relies* on there only ever being one single entry
> that needs to be woken up.
> 
> And as far as I can tell, it really is because the queue only ever has
> one entry (ie it's per-vcpu, and when the vcpu is blocked, it's
> blocked - so no other user will be waiting there).

Exactly.
> 
> So it isn't that you migth queue multiple entries and then just wake
> them up one at a time. There really is just one entry at a time,
> right?

Yes.

> And that means that swait is actuially completely the wrong thing to
> do. It's more expensive and more complex than just saving the single
> process pointer away and just doing "wake_up_process()".

Aha, i see.

> 
> Now, it really is entirely possible that I'm missing something, but it
> does look like that to me.

Just drop it -- the optimization is not relevant anymore given VMX
hardware improvements.

> We've had wake_up_process() since pretty much day #1. THAT is the
> fastest and simplest direct wake-up there is, not some "simple
> wait-queue".
> 
> Now, admittedly I don't know the code and really may be entirely off,
> but looking at the commit (no need to go to the lkml archives - it's
> commit 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") in
> mainline), I really think the swait() use is simply not correct if
> there can be multiple waiters, exactly because swake_up() only wakes
> up a single entry.

There can't be: its one emulated LAPIC per vcpu. So only one vcpu
waits for that waitqueue.

> So either there is only a single entry, or *all* the code like
> 
>         dvcpu->arch.wait = 0;
> 
> -       if (waitqueue_active(&dvcpu->wq))
> -               wake_up_interruptible(&dvcpu->wq);
> +       if (swait_active(&dvcpu->wq))
> +               swake_up(&dvcpu->wq);
> 
> is simply wrong. If there are multiple blockers, and you just cleared
> "arch.wait", I think they should *all* be woken up. And that's not
> what swake_up() does.
> 
> So I think that kvm_vcpu_block() could easily have instead done
> 
>     vcpu->process = current;
> 
> as the "prepare_to_wait()" part, and "finish_wait()" would be to just
> clear vcpu->process. No wait-queue, just a single pointer to the
> single blocking thread.
> 
> (Of course, you still need serialization, so that
> "wake_up_process(vcpu->process)" doesn't end up using a stale value,
> but since processes are already freed with RCU because of other things
> like that, the serialization is very low-cost, you only need to be
> RCU-read safe when waking up).
> 
> See what I'm saying?
> 
> Note that "wake_up_process()" really is fairly widely used. It's
> widely used because it's fairly obvious, and because that really *is*
> the lowest-possible cost: a single pointer to the sleeping thread, and
> you can often do almost no locking at all.
> 
> And unlike swake_up(), it's obvious that you only wake up a single thread.
> 
>            Linus

Feel free to drop the KVM usage... agreed the interface is a special 
case and a generic one which handles multiple waiters 
and debugging etc should be preferred.

Not sure if other people are using it, thought.

Marcelo Tosatti June 30, 2017, 11:57 a.m. UTC | #19

On Thu, Jun 29, 2017 at 09:03:42PM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 12:15 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Jun 29, 2017 at 09:13:29AM -0700, Linus Torvalds wrote:
> >>
> >> swait uses special locking and has odd semantics that are not at all
> >> the same as the default wait queue ones. It should not be used without
> >> very strong reasons (and honestly, the only strong enough reason seems
> >> to be "RT").
> >
> > Performance shortcut:
> >
> > https://lkml.org/lkml/2016/2/25/301
> 
> Yes, I know why kvm uses it, I just don't think it's necessarily the
> right thing.
> 
> That kvm commit is actually a great example: it uses swake_up() from
> an interrupt, and that's in fact the *reason* it uses swake_up().
> 
> But that also fundamentally means that it cannot use swake_up_all(),
> so it basically *relies* on there only ever being one single entry
> that needs to be woken up.
> 
> And as far as I can tell, it really is because the queue only ever has
> one entry (ie it's per-vcpu, and when the vcpu is blocked, it's
> blocked - so no other user will be waiting there).

Exactly.
> 
> So it isn't that you migth queue multiple entries and then just wake
> them up one at a time. There really is just one entry at a time,
> right?

Yes.

> And that means that swait is actuially completely the wrong thing to
> do. It's more expensive and more complex than just saving the single
> process pointer away and just doing "wake_up_process()".

Aha, i see.

> 
> Now, it really is entirely possible that I'm missing something, but it
> does look like that to me.

Just drop it -- the optimization is not relevant anymore given VMX
hardware improvements.

> We've had wake_up_process() since pretty much day #1. THAT is the
> fastest and simplest direct wake-up there is, not some "simple
> wait-queue".
> 
> Now, admittedly I don't know the code and really may be entirely off,
> but looking at the commit (no need to go to the lkml archives - it's
> commit 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") in
> mainline), I really think the swait() use is simply not correct if
> there can be multiple waiters, exactly because swake_up() only wakes
> up a single entry.

There can't be: its one emulated LAPIC per vcpu. So only one vcpu
waits for that waitqueue.

> So either there is only a single entry, or *all* the code like
> 
>         dvcpu->arch.wait = 0;
> 
> -       if (waitqueue_active(&dvcpu->wq))
> -               wake_up_interruptible(&dvcpu->wq);
> +       if (swait_active(&dvcpu->wq))
> +               swake_up(&dvcpu->wq);
> 
> is simply wrong. If there are multiple blockers, and you just cleared
> "arch.wait", I think they should *all* be woken up. And that's not
> what swake_up() does.
> 
> So I think that kvm_vcpu_block() could easily have instead done
> 
>     vcpu->process = current;
> 
> as the "prepare_to_wait()" part, and "finish_wait()" would be to just
> clear vcpu->process. No wait-queue, just a single pointer to the
> single blocking thread.
> 
> (Of course, you still need serialization, so that
> "wake_up_process(vcpu->process)" doesn't end up using a stale value,
> but since processes are already freed with RCU because of other things
> like that, the serialization is very low-cost, you only need to be
> RCU-read safe when waking up).
> 
> See what I'm saying?
> 
> Note that "wake_up_process()" really is fairly widely used. It's
> widely used because it's fairly obvious, and because that really *is*
> the lowest-possible cost: a single pointer to the sleeping thread, and
> you can often do almost no locking at all.
> 
> And unlike swake_up(), it's obvious that you only wake up a single thread.
> 
>            Linus

Feel free to drop the KVM usage... agreed the interface is a special 
case and a generic one which handles multiple waiters 
and has debugging etc should be preferred to avoid bugs

Not sure if other people are using it (swait).

Krister Johansen June 30, 2017, 5:30 p.m. UTC | #20

On Thu, Jun 29, 2017 at 09:03:42PM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 12:15 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Jun 29, 2017 at 09:13:29AM -0700, Linus Torvalds wrote:
> >>
> >> swait uses special locking and has odd semantics that are not at all
> >> the same as the default wait queue ones. It should not be used without
> >> very strong reasons (and honestly, the only strong enough reason seems
> >> to be "RT").
> >
> > Performance shortcut:
> >
> > https://lkml.org/lkml/2016/2/25/301
> 
> Now, admittedly I don't know the code and really may be entirely off,
> but looking at the commit (no need to go to the lkml archives - it's
> commit 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") in
> mainline), I really think the swait() use is simply not correct if
> there can be multiple waiters, exactly because swake_up() only wakes
> up a single entry.
> 
> So either there is only a single entry, or *all* the code like
> 
>         dvcpu->arch.wait = 0;
> 
> -       if (waitqueue_active(&dvcpu->wq))
> -               wake_up_interruptible(&dvcpu->wq);
> +       if (swait_active(&dvcpu->wq))
> +               swake_up(&dvcpu->wq);
> 
> is simply wrong. If there are multiple blockers, and you just cleared
> "arch.wait", I think they should *all* be woken up. And that's not
> what swake_up() does.

Code like this is probably wrong for another reason too.  The
swait_active() is likely redudant, since swake_up() also calls
swait_active().  The check in swake_up() returns if it thinks there are
no active waiters.  However, the synchronization needed to ensure a
proper wakeup is left as an exercise to swake_up's caller.

There have been a couple of other discussions around this topic
recently:

https://lkml.org/lkml/2017/5/25/722
https://lkml.org/lkml/2017/6/8/1222

The above is better written as the following, but even then you still
have the single/multiple wakeup problem:

 -       if (waitqueue_active(&dvcpu->wq))
 -               wake_up_interruptible(&dvcpu->wq);
 +       smp_mb();
 +       swake_up(&dvcpu->wq);


Just to add to the confusion, the last time I checked, the semantics of
swake_up() even differ between RT Linux and mainline, which makes this
even more confusing.

-K

[2/4] swait: add the missing killable swaits

Commit Message

Comments

Patch