[GIT,PULL] pipe: nonblocking rw for io_uring

Message ID	20230421-seilbahn-vorpreschen-bd73ac3c88d7@brauner (mailing list archive)
State	Mainlined, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> From: Christian Brauner <brauner@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jens Axboe <axboe@kernel.dk> Subject: [GIT PULL] pipe: nonblocking rw for io_uring Date: Fri, 21 Apr 2023 16:01:20 +0200 Message-Id: <20230421-seilbahn-vorpreschen-bd73ac3c88d7@brauner> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[GIT,PULL] pipe: nonblocking rw for io_uring \| expand [GIT,PULL] pipe: nonblocking rw for io_uring

Christian Brauner April 21, 2023, 2:01 p.m. UTC

Hey Linus,

/* Summary */
This contains Jens' work to support FMODE_NOWAIT and thus IOCB_NOWAIT
for pipes ensuring that all places can deal with non-blocking requests.

To this end, pass down the information that this is a nonblocking
request so that pipe locking, allocation, and buffer checking correctly
deal with those.

The series is small but it felt standalone enough that I didn't want to
lump it together with other, generic vfs work.

/* Testing */
clang: Ubuntu clang version 15.0.6
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0

All patches are based on 6.3-rc2 and have been sitting in linux-next.
No build failures or warnings were observed. All old and new tests in
fstests, selftests, and LTP pass without regressions.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.4/vfs.pipe

for you to fetch changes up to ec30adeb289d9054efae4e285b269438ce63fe03:

  pipe: set FMODE_NOWAIT on pipes (2023-03-15 11:37:29 -0600)

Please consider pulling these changes from the signed v6.4/vfs.pipe tag.

Thanks!
Christian

----------------------------------------------------------------
v6.4/vfs.pipe

----------------------------------------------------------------
Jens Axboe (3):
      fs: add 'nonblock' parameter to pipe_buf_confirm() and fops method
      pipe: enable handling of IOCB_NOWAIT
      pipe: set FMODE_NOWAIT on pipes

 fs/fuse/dev.c             |  4 ++--
 fs/pipe.c                 | 42 ++++++++++++++++++++++++++++++++++--------
 fs/splice.c               | 11 +++++++----
 include/linux/pipe_fs_i.h |  8 +++++---
 4 files changed, 48 insertions(+), 17 deletions(-)

Linus Torvalds April 24, 2023, 9:05 p.m. UTC | #1

On Fri, Apr 21, 2023 at 7:02 AM Christian Brauner <brauner@kernel.org> wrote:
>
> This contains Jens' work to support FMODE_NOWAIT and thus IOCB_NOWAIT
> for pipes ensuring that all places can deal with non-blocking requests.
>
> To this end, pass down the information that this is a nonblocking
> request so that pipe locking, allocation, and buffer checking correctly
> deal with those.

Ok, I pulled this, but then I unpulled it again.

Doing conditional locking for O_NONBLOCK and friends is not ok. Yes,
it's been done, and I may even have let some slip through, but it's
just WRONG.

There is absolutely no situation where a "ok, so the lock on this data
structure was taken, we'll go to some async model" is worth it.

Every single time I've seen this, it's been some developer who thinks
that O_NONBLOCk is somehow some absolute "this cannot schedule" thing.
And every single time it's been broken and horrid crap that just made
everything more complicated and slowed things down.

If some lock wait is a real problem, then the lock needs to be just
fixed. Not "ok, let's make a non-blocking version and fall back if
it's held".

Note that FMODE_NOWAIT does not mean (and *CANNOT* mean) that you'd
somehow be able to do the IO in some atomic context anyway. Many of
our kernel locks don't even support that (eg mutexes).

So thinking that FMODE_NOWAIT is that kind of absolute is the wrong
kind of thinking entirely.

FMODE_NOWAIT should mean that no *IO* gets done. And yes, that might
mean that allocations fail too. But not this kind of "let's turn
locking into 'trylock' stuff".

The whoe flag is misnamed. It should have been FMODE_NOIO, the same
way we have IOCB_NOIO.

If you want FMODE_ATOMIC, then that is something entirely and
completely different, and is probably crazy.

We have done it in one area (RCU pathname lookup), and it was worth it
there, and it was a *huge* undertaking. It was worth it, but it was
worth it because it was a serious thing with some serious design and a
critical area.

Not this kind of "conditional trylock" garbage which just means that
people will require 'poll()' to now add the lock to the waitqueue, or
make all callers go into some "let's use a different thread instead"
logic.

                             Linus

Jens Axboe April 24, 2023, 9:22 p.m. UTC | #2

On 4/24/23 3:05?PM, Linus Torvalds wrote:
> On Fri, Apr 21, 2023 at 7:02?AM Christian Brauner <brauner@kernel.org> wrote:
>>
>> This contains Jens' work to support FMODE_NOWAIT and thus IOCB_NOWAIT
>> for pipes ensuring that all places can deal with non-blocking requests.
>>
>> To this end, pass down the information that this is a nonblocking
>> request so that pipe locking, allocation, and buffer checking correctly
>> deal with those.
> 
> Ok, I pulled this, but then I unpulled it again.
> 
> Doing conditional locking for O_NONBLOCK and friends is not ok. Yes,
> it's been done, and I may even have let some slip through, but it's
> just WRONG.
> 
> There is absolutely no situation where a "ok, so the lock on this data
> structure was taken, we'll go to some async model" is worth it.
> 
> Every single time I've seen this, it's been some developer who thinks
> that O_NONBLOCk is somehow some absolute "this cannot schedule" thing.
> And every single time it's been broken and horrid crap that just made
> everything more complicated and slowed things down.
> 
> If some lock wait is a real problem, then the lock needs to be just
> fixed. Not "ok, let's make a non-blocking version and fall back if
> it's held".
> 
> Note that FMODE_NOWAIT does not mean (and *CANNOT* mean) that you'd
> somehow be able to do the IO in some atomic context anyway. Many of
> our kernel locks don't even support that (eg mutexes).
> 
> So thinking that FMODE_NOWAIT is that kind of absolute is the wrong
> kind of thinking entirely.
> 
> FMODE_NOWAIT should mean that no *IO* gets done. And yes, that might
> mean that allocations fail too. But not this kind of "let's turn
> locking into 'trylock' stuff".
> 
> The whoe flag is misnamed. It should have been FMODE_NOIO, the same
> way we have IOCB_NOIO.
> 
> If you want FMODE_ATOMIC, then that is something entirely and
> completely different, and is probably crazy.
> 
> We have done it in one area (RCU pathname lookup), and it was worth it
> there, and it was a *huge* undertaking. It was worth it, but it was
> worth it because it was a serious thing with some serious design and a
> critical area.
> 
> Not this kind of "conditional trylock" garbage which just means that
> people will require 'poll()' to now add the lock to the waitqueue, or
> make all callers go into some "let's use a different thread instead"
> logic.

I'm fully agreeing with you on the semantics for
FMODE_NOWAIT/IOCB_NOWAIT, and also agree that they are misnamed and
really should be _NOIO. There is no conceptual misunderstanding there,
nor do I care about atomic semantics. Obviously the above is for
io_uring to be able to improve the performance on pipes, because right
now everything is punted to io-wq and that severly hampers performance
with how pipes are generally used. io_uring doesn't care about atomic
context, but it very much cares about NOT sleeping for IO during issue
as that has adverse performance implications.

If we don't ever wait for IO with the pipe lock held, then we can skip
the conditional locking. But with splice, that's not at all the case! We
most certainly wait for IO there with the pipe lock held.

And yes I totally agree that conditional locking is not pretty, but for
some cases it really is a necessary evil. People have complained about
io_uring pipe performance, and I ran some testing myself. Being able to
sanely read/write from/to pipes without punting to io-wq is an easy 10x
improvement for that use case. How can I enable FMODE_NOWAIT on pipes if
we have splice holding the pipe lock over IO? It's not feasible.

So please reconsider, doing IO to/from pipes efficiently is actually
kind of useful for io_uring...

Linus Torvalds April 24, 2023, 9:37 p.m. UTC | #3

On Mon, Apr 24, 2023 at 2:22 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> If we don't ever wait for IO with the pipe lock held, then we can skip
> the conditional locking. But with splice, that's not at all the case! We
> most certainly wait for IO there with the pipe lock held.

I think that then needs to just be fixed.

I really think that trylock due to "nonblocking" IO is fundamentally
wrong. Thinking that you need it is just a sign of something else
being very wrong.

That "very wrong" thing might well be splice then not honoring
non-blocking IO on a non-blocking pipe.

And I completely refuse to add that trylock hack to paper that over.
The pipe lock is *not* meant for IO.

             Linus

Jens Axboe April 24, 2023, 9:55 p.m. UTC | #4

On 4/24/23 3:37?PM, Linus Torvalds wrote:
> On Mon, Apr 24, 2023 at 2:22?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> If we don't ever wait for IO with the pipe lock held, then we can skip
>> the conditional locking. But with splice, that's not at all the case! We
>> most certainly wait for IO there with the pipe lock held.
> 
> I think that then needs to just be fixed.

I took another look at this, and the main issue is in fact splice
confirming buffers. So I do think that we can make this work by simply
having the non-block nature of it being passed down the ->confirm()
callback as that's the one that'll be waiting for IO. If we have that,
then we can disregard the pipe locking as we won't be holding it over
IO.

Which is what part of this series does, notably patch 1.

Only other oddity is pipe_to_sendpage(), which we can probably sanely
ignore.

IOW, would you be fine with a v2 of this pull request where patch 2
drops the conditional locking and just passes it to ->confirm()? That's
certainly sane, and just makes the ultimate page locking conditional to
avoid waiting on IO. I'd really hate to still be missing out on pipe
performance with io_uring.

Linus Torvalds April 24, 2023, 9:58 p.m. UTC | #5

On Mon, Apr 24, 2023 at 2:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I completely refuse to add that trylock hack to paper that over.
> The pipe lock is *not* meant for IO.

If you want to paper it over, do it other ways.

I'd love to just magically fix splice, but hey, that might not be possible.

But possible fixes papering this over might be to make splice "poison
a pipe, and make io_uring falls back on io workers only on pipes that
do splice. Make any normal pipe read/write load sane.

And no, don't worry about races. If you have the same pipe used for
io_uring IO *and* somebody else then doing splice on it and racing,
just take the loss and tell people that they might hit a slow case if
they do stupid things.

Basically, the patch might look like something like

 - do_pipe() sets FMODE_NOWAIT by default when creating a pipe

 - splice then clears FMODE_NOWAIT on pipes as they are used

and now io_uring sees whether the pipe is playing nice or not.

As far as I can tell, something like that would make the
'pipe_buf_confirm()' part unnecessary too, since that's only relevant
for splice.

A fancier version might be to only do that "splice then clears
FMODE_NOWAIT" thing if the other side of the splice has not set
FMODE_NOWAIT.

Honestly, if the problem is "pipe IO is slow", then splice should not
be the thing you optimize for.

                 Linus

Linus Torvalds April 24, 2023, 10 p.m. UTC | #6

On Mon, Apr 24, 2023 at 2:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> I took another look at this, and the main issue is in fact splice
> confirming buffers. So I do think that we can make this work by simply
> having the non-block nature of it being passed down the ->confirm()
> callback as that's the one that'll be waiting for IO. If we have that,
> then we can disregard the pipe locking as we won't be holding it over
> IO.

Ok, that part looks fine to me.

The pipe_buf_confirm() part of the series I don't find problematic,
it's really conditional locking that I absolutely detest and has
always been a sign of problems elsewhere.

             Linus

Jens Axboe April 24, 2023, 10:03 p.m. UTC | #7

On 4/24/23 3:55?PM, Jens Axboe wrote:
> On 4/24/23 3:37?PM, Linus Torvalds wrote:
>> On Mon, Apr 24, 2023 at 2:22?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> If we don't ever wait for IO with the pipe lock held, then we can skip
>>> the conditional locking. But with splice, that's not at all the case! We
>>> most certainly wait for IO there with the pipe lock held.
>>
>> I think that then needs to just be fixed.
> 
> I took another look at this, and the main issue is in fact splice
> confirming buffers. So I do think that we can make this work by simply
> having the non-block nature of it being passed down the ->confirm()
> callback as that's the one that'll be waiting for IO. If we have that,
> then we can disregard the pipe locking as we won't be holding it over
> IO.
> 
> Which is what part of this series does, notably patch 1.
> 
> Only other oddity is pipe_to_sendpage(), which we can probably sanely
> ignore.
> 
> IOW, would you be fine with a v2 of this pull request where patch 2
> drops the conditional locking and just passes it to ->confirm()? That's
> certainly sane, and just makes the ultimate page locking conditional to
> avoid waiting on IO. I'd really hate to still be missing out on pipe
> performance with io_uring.

I guess that would still have blocking if you have someone doing splice
in a blocking fashion, and someone else trying to do RWF_NOWAIT reads or
writes to the pipe... The very thing the conditional pipe locking would
sort out.

Only fool proof alternative would seem to be having splice use a
specific pipe lock rather then pipe->mutex. And honestly pipes and
splice are so tied together than I'm not sure that doing separate
locking would be feasible.

Jens Axboe April 24, 2023, 10:05 p.m. UTC | #8

On 4/24/23 4:00?PM, Linus Torvalds wrote:
> On Mon, Apr 24, 2023 at 2:55?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I took another look at this, and the main issue is in fact splice
>> confirming buffers. So I do think that we can make this work by simply
>> having the non-block nature of it being passed down the ->confirm()
>> callback as that's the one that'll be waiting for IO. If we have that,
>> then we can disregard the pipe locking as we won't be holding it over
>> IO.
> 
> Ok, that part looks fine to me.
> 
> The pipe_buf_confirm() part of the series I don't find problematic,
> it's really conditional locking that I absolutely detest and has
> always been a sign of problems elsewhere.

Agree, the conditional locking is the ugly bit for sure. I'll reply to
your other email as my followup to my message discovered that this isn't
enough due to mixed splice/non-splice usage.

Jens Axboe April 24, 2023, 10:07 p.m. UTC | #9

On 4/24/23 3:58?PM, Linus Torvalds wrote:
> On Mon, Apr 24, 2023 at 2:37?PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> And I completely refuse to add that trylock hack to paper that over.
>> The pipe lock is *not* meant for IO.
> 
> If you want to paper it over, do it other ways.
> 
> I'd love to just magically fix splice, but hey, that might not be possible.

Don't think it is... At least not trivially.

> But possible fixes papering this over might be to make splice "poison
> a pipe, and make io_uring falls back on io workers only on pipes that
> do splice. Make any normal pipe read/write load sane.
> 
> And no, don't worry about races. If you have the same pipe used for
> io_uring IO *and* somebody else then doing splice on it and racing,
> just take the loss and tell people that they might hit a slow case if
> they do stupid things.
> 
> Basically, the patch might look like something like
> 
>  - do_pipe() sets FMODE_NOWAIT by default when creating a pipe
> 
>  - splice then clears FMODE_NOWAIT on pipes as they are used
> 
> and now io_uring sees whether the pipe is playing nice or not.
> 
> As far as I can tell, something like that would make the
> 'pipe_buf_confirm()' part unnecessary too, since that's only relevant
> for splice.
> 
> A fancier version might be to only do that "splice then clears
> FMODE_NOWAIT" thing if the other side of the splice has not set
> FMODE_NOWAIT.
> 
> Honestly, if the problem is "pipe IO is slow", then splice should not
> be the thing you optimize for.

I think that'd be an acceptable approach, and would at least fix the
pure pipe case which I suspect is 99.9% of them, if not more. And yes,
it'd mean that we don't need to do the ->confirm() change either, as the
pipe is already tainted at that point.

I'll respin a v2, post, and send in later this merge window.

Jens Axboe April 24, 2023, 10:44 p.m. UTC | #10

On 4/24/23 4:07?PM, Jens Axboe wrote:
> On 4/24/23 3:58?PM, Linus Torvalds wrote:
>> On Mon, Apr 24, 2023 at 2:37?PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> And I completely refuse to add that trylock hack to paper that over.
>>> The pipe lock is *not* meant for IO.
>>
>> If you want to paper it over, do it other ways.
>>
>> I'd love to just magically fix splice, but hey, that might not be possible.
> 
> Don't think it is... At least not trivially.
> 
>> But possible fixes papering this over might be to make splice "poison
>> a pipe, and make io_uring falls back on io workers only on pipes that
>> do splice. Make any normal pipe read/write load sane.
>>
>> And no, don't worry about races. If you have the same pipe used for
>> io_uring IO *and* somebody else then doing splice on it and racing,
>> just take the loss and tell people that they might hit a slow case if
>> they do stupid things.
>>
>> Basically, the patch might look like something like
>>
>>  - do_pipe() sets FMODE_NOWAIT by default when creating a pipe
>>
>>  - splice then clears FMODE_NOWAIT on pipes as they are used
>>
>> and now io_uring sees whether the pipe is playing nice or not.
>>
>> As far as I can tell, something like that would make the
>> 'pipe_buf_confirm()' part unnecessary too, since that's only relevant
>> for splice.
>>
>> A fancier version might be to only do that "splice then clears
>> FMODE_NOWAIT" thing if the other side of the splice has not set
>> FMODE_NOWAIT.
>>
>> Honestly, if the problem is "pipe IO is slow", then splice should not
>> be the thing you optimize for.
> 
> I think that'd be an acceptable approach, and would at least fix the
> pure pipe case which I suspect is 99.9% of them, if not more. And yes,
> it'd mean that we don't need to do the ->confirm() change either, as the
> pipe is already tainted at that point.
> 
> I'll respin a v2, post, and send in later this merge window.

Something like this. Not tested yet, but wanted to get your feedback
early to avoid issues down the line. Really just the first hunk, as
we're not really expecting f_mode to change and hence could just use
an xchg(). Seems saner to use a cmpxchg though, but maybe that's
too much of both belt and suspenders?

commit f10cbebf4dd7f77b0e87c77bb0191a5a07d5e7ac
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Apr 24 16:32:55 2023 -0600

    splice: clear FMODE_NOWAIT on file if splice/vmsplice is used
    
    In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that
    RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice
    clear that file flag. Splice holds the pipe lock around IO and cannot
    easily be refactored to avoid that, as splice and pipes are inherently
    tied together.
    
    By clearing FMODE_NOWAIT if splice is being used on a pipe, other users
    of the pipe will know that the pipe is no longer safe for RWF_NOWAIT
    and friends.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/splice.c b/fs/splice.c
index 2c3dec2b6dfa..3ce7c07621e2 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -37,6 +37,29 @@
 
 #include "internal.h"
 
+/*
+ * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to
+ * indicate they support non-blocking reads or writes, we must clear it
+ * here if set to avoid blocking other users of this pipe if splice is
+ * being done on it.
+ */
+static void pipe_clear_nowait(struct file *file)
+{
+	fmode_t new_fmode, old_fmode;
+
+	/*
+	 * We could get by with just doing an xchg() here as f_mode should
+	 * not change in the file's lifetime, but let's be safe and use
+	 * a cmpxchg() instead.
+	 */
+	do {
+		old_fmode = READ_ONCE(file->f_mode);
+		if (!(old_fmode & FMODE_NOWAIT))
+			break;
+		new_fmode = old_fmode & ~FMODE_NOWAIT;
+	} while (!try_cmpxchg(&file->f_mode, &old_fmode, new_fmode));
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -1211,10 +1234,16 @@ static long __do_splice(struct file *in, loff_t __user *off_in,
 	ipipe = get_pipe_info(in, true);
 	opipe = get_pipe_info(out, true);
 
-	if (ipipe && off_in)
-		return -ESPIPE;
-	if (opipe && off_out)
-		return -ESPIPE;
+	if (ipipe) {
+		if (off_in)
+			return -ESPIPE;
+		pipe_clear_nowait(in);
+	}
+	if (opipe) {
+		if (off_out)
+			return -ESPIPE;
+		pipe_clear_nowait(out);
+	}
 
 	if (off_out) {
 		if (copy_from_user(&offset, off_out, sizeof(loff_t)))
@@ -1311,6 +1340,8 @@ static long vmsplice_to_user(struct file *file, struct iov_iter *iter,
 	if (!pipe)
 		return -EBADF;
 
+	pipe_clear_nowait(file);
+
 	if (sd.total_len) {
 		pipe_lock(pipe);
 		ret = __splice_from_pipe(pipe, &sd, pipe_to_user);
@@ -1339,6 +1370,8 @@ static long vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
 	if (!pipe)
 		return -EBADF;
 
+	pipe_clear_nowait(file);
+
 	pipe_lock(pipe);
 	ret = wait_for_space(pipe, flags);
 	if (!ret)

Linus Torvalds April 25, 2023, 3:16 a.m. UTC | #11

On Mon, Apr 24, 2023 at 3:45 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Something like this. Not tested yet, but wanted to get your feedback
> early to avoid issues down the line.

Ok, that try_cmpxchg() loop looks odd, but I guess it's the right thing to do.

That said, you should move the

        old_fmode = READ_ONCE(file->f_mode);

to outside the loop, because try_cmpxchg() will then update
'old_fmode' to the proper value and it should *not* be done again.

I also suspect that the READ_ONCE() itself is entirely superfluous,
because that is very much a value that we should let the compiler
optimize to *not* have to access more than it needs to.

So if the compiler had an earlier copy of that value, it should just
use it, rather than us forcing it to read it again.

But I suspect in this case it makes no real difference to code
generation. There's nothing else around it that uses f_mode, afaik,
and the try_cmpxchg() will reload it properly to fix any possible
races up.

The READ_ONCE() would possibly make more sense if we actually expected
that FMODE_NOWAIT bit to change more than once, but then we'd
presuably have some ordering rule and it should be a
smp_load_acquire() or whatever.

As it is, if we ever see it clear, we don't care any more, and the
real value consistency guarantee is in the try_cmpxchg itself. There
are no possible races ot mis-readings that could matter.

So I think it could/should just be something like

    void pipe_clear_nowait(struct file *file)
    {
        fmode_t fmode = file->f_mode;
        do {
            if (!(fmode & FMODE_NOWAIT))
                break;
        } while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT));
    }

which sadly generates that big constant just because FMODE_NOWAIT is
up in the high bits and with the 'try_cmpxchg()', the compiler has no
choice but to get the full 32-bit value anyway.

               Linus

Jens Axboe April 25, 2023, 1:46 p.m. UTC | #12

On 4/24/23 9:16?PM, Linus Torvalds wrote:
> On Mon, Apr 24, 2023 at 3:45?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Something like this. Not tested yet, but wanted to get your feedback
>> early to avoid issues down the line.
> 
> Ok, that try_cmpxchg() loop looks odd, but I guess it's the right thing to do.
> 
> That said, you should move the
> 
>         old_fmode = READ_ONCE(file->f_mode);
> 
> to outside the loop, because try_cmpxchg() will then update
> 'old_fmode' to the proper value and it should *not* be done again.
> 
> I also suspect that the READ_ONCE() itself is entirely superfluous,
> because that is very much a value that we should let the compiler
> optimize to *not* have to access more than it needs to.
> 
> So if the compiler had an earlier copy of that value, it should just
> use it, rather than us forcing it to read it again.
> 
> But I suspect in this case it makes no real difference to code
> generation. There's nothing else around it that uses f_mode, afaik,
> and the try_cmpxchg() will reload it properly to fix any possible
> races up.
> 
> The READ_ONCE() would possibly make more sense if we actually expected
> that FMODE_NOWAIT bit to change more than once, but then we'd
> presuably have some ordering rule and it should be a
> smp_load_acquire() or whatever.
> 
> As it is, if we ever see it clear, we don't care any more, and the
> real value consistency guarantee is in the try_cmpxchg itself. There
> are no possible races ot mis-readings that could matter.
> 
> So I think it could/should just be something like
> 
>     void pipe_clear_nowait(struct file *file)
>     {
>         fmode_t fmode = file->f_mode;
>         do {
>             if (!(fmode & FMODE_NOWAIT))
>                 break;
>         } while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT));
>     }
> 
> which sadly generates that big constant just because FMODE_NOWAIT is
> up in the high bits and with the 'try_cmpxchg()', the compiler has no
> choice but to get the full 32-bit value anyway.

I'll go with this, it's definitely simpler. I suspected the important
bit was just doing the cmpxchg sanely and that the READ_ONCE() was
superflous given how it's used, and dropping the old_fmode is cleaner.

FWIW, I don't see any difference in code generation here on arm64 or
x86-64 if FMODE_NOWAIT is in the lower bits, as we could've just moved
it. We could make it the sign bit which might make the first compare
faster in general, but honestly don't think we really care that deeply
about that.

Updated the branch, it's:

https://git.kernel.dk/cgit/linux/log/?h=pipe-nonblock.2

It's just the cmpxchg patch now and the same "set FMODE_NOWAIT on pipes
unconditionally" from before.

Linus Torvalds April 25, 2023, 5:20 p.m. UTC | #13

On Mon, Apr 24, 2023 at 8:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, you should move the
>
>         old_fmode = READ_ONCE(file->f_mode);
>
> to outside the loop, because try_cmpxchg() will then update
> 'old_fmode' to the proper value and it should *not* be done again.
>
> I also suspect that the READ_ONCE() itself is entirely superfluous,
> because that is very much a value that we should let the compiler
> optimize to *not* have to access more than it needs to.

I ended up looking around a bit, and the READ_ONCE() before the
"try_cmpxchg()" loop is definitely our common pattern.

However, I'm adding in the locking people, because I think that
pattern is wrong and historical.

I think it comes from the original cmpxchg loop model, where we did
the read inside the loop, and the READ_ONCE() was some kind of "let's
make sure it's updated every time" thing (which should be unnecessary
due to memory clobbers on the cmpxchg, but whatever...

Inside the loop, the READ_ONCE() also makes more sense in general, in
that there isn't any sane point in merging it with earlier values, so
there's no downside.

But the more I look at it, the more I'm convinced that our pattern of

        old = READ_ONCE(rq->fence.error);
        do {
                if (fatal_error(old))
                        return false;
        } while (!try_cmpxchg(&rq->fence.error, &old, error));

(to pick one random user) is simply horribly wrong.

If we have code where we had an earlier load of that same value (or an
earlier store, for that matter - anything where the compiler can
assume some starting value), then the READ_ONCE() only adds pointless
overhead.

And if we don't have it, then the READ_ONCE() doesn't add any value,
since it doesn't imply any ordering.

IOW, I think the READ_ONCE() pattern is either pointless or
detrimental. I see no upside.

So I think we should make the pattern be used with a try_cmpxchg()
loop to be either a proper *ordered* read (ie something like
"smp_load_acquire()" for reading the original value), or we should
just let the compiler read it any which way it wants. Not the
READ_ONCE() pattern we seem to have.

But I'm adding the locking maintainers to this email to make sure that
I'm not missing anything.

PeterZ in particular, who added that new (and hugely improved!)
try_cmpxchg() interface in the first place.

        Linus

Peter Zijlstra April 25, 2023, 7:49 p.m. UTC | #14

On Tue, Apr 25, 2023 at 10:20:31AM -0700, Linus Torvalds wrote:

> But the more I look at it, the more I'm convinced that our pattern of
> 
>         old = READ_ONCE(rq->fence.error);
>         do {
>                 if (fatal_error(old))
>                         return false;
>         } while (!try_cmpxchg(&rq->fence.error, &old, error));
> 
> (to pick one random user) is simply horribly wrong.

The last time this came up I shared your view however Mark argued for
the READ_ONCE() thusly:

  https://lore.kernel.org/all/Y71LoCIl+IFdy9D8@FVFF77S0Q05N/T/#u

Linus Torvalds April 25, 2023, 7:58 p.m. UTC | #15

On Tue, Apr 25, 2023 at 12:49 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The last time this came up I shared your view however Mark argued for
> the READ_ONCE() thusly:
>
>   https://lore.kernel.org/all/Y71LoCIl+IFdy9D8@FVFF77S0Q05N/T/#u

Hmm.

Yes, I think Mark is right. It's not that 'old' might be wrong - that
doesn't matter because cmpxchg will work it out - it's just that 'new'
might not be consistent with the old value we then use.

Ok. I'll try to remember this, but maybe it might be worth documenting.

Jens - I don't think this actually matters for the f_mode value issue,
since the only thing that might change is that FMODE_NOWAIT bit, but I
was clearly wrong on READ_ONCE(). So that loop should have it, just to
have the right pattern after all.

My bad.

             Linus

Jens Axboe April 25, 2023, 8:10 p.m. UTC | #16

On 4/25/23 1:58?PM, Linus Torvalds wrote:
> Jens - I don't think this actually matters for the f_mode value issue,
> since the only thing that might change is that FMODE_NOWAIT bit, but I
> was clearly wrong on READ_ONCE(). So that loop should have it, just to
> have the right pattern after all.

Noted, I'll update it so it's consistent with the other use cases.
Thanks!

Linus Torvalds April 25, 2023, 8:29 p.m. UTC | #17

On Tue, Apr 25, 2023 at 12:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok. I'll try to remember this, but maybe it might be worth documenting.

We might document it by just making it clear that it's not that we
want to read it "once", it's that we want to have a stable value.

There might be other situations where that is all we want.

IOW, maybe we could have something like

  #define READ_STABLE(x) \
      ({ __auto_type __val = (x); OPTIMIZER_HIDE_VAR(__val); __val; })

instead - although from a quick look, the code generation is pretty
much exactly the same.

I dunno. Just throwing that idea out there as a "if reading _once_
isn't the issue, maybe we shouldn't make the code look like it is"...

                Linus

Peter Zijlstra May 8, 2023, 8:39 a.m. UTC | #18

On Sun, May 07, 2023 at 04:04:23PM +0200, Jonas Oberhauser wrote:
> 
> Am 4/25/2023 um 9:58 PM schrieb Linus Torvalds:
> > Yes, I think Mark is right. It's not that 'old' might be wrong - that
> > doesn't matter because cmpxchg will work it out - it's just that 'new'
> > might not be consistent with the old value we then use.
> 
> In the general pattern, besides the potential issue raised by Mark, tearing
> may also be an issue (longer example inspired by a case we met at the end of
> the mail) where 'old' being wrong matters.

There is yet another pattern where it actually matters:

	old = READ_ONCE(*ptr);
	do {
		if (cond(old))
			return false;

		new = func(old);
	} while (!try_cmpxchg(ptr, &old, new));

	return true;

In this case we rely on old being 'coherent'. The more obvious case is
where it returns old (also not uncommon), but even if it just checks a
(multi-bit) condition on old you don't want tearing.

David Laight May 8, 2023, 10:16 a.m. UTC | #19

From: Peter Zijlstra
> Sent: 08 May 2023 09:39
> 
> On Sun, May 07, 2023 at 04:04:23PM +0200, Jonas Oberhauser wrote:
> >
> > Am 4/25/2023 um 9:58 PM schrieb Linus Torvalds:
> > > Yes, I think Mark is right. It's not that 'old' might be wrong - that
> > > doesn't matter because cmpxchg will work it out - it's just that 'new'
> > > might not be consistent with the old value we then use.
> >
> > In the general pattern, besides the potential issue raised by Mark, tearing
> > may also be an issue (longer example inspired by a case we met at the end of
> > the mail) where 'old' being wrong matters.
> 
> There is yet another pattern where it actually matters:
> 
> 	old = READ_ONCE(*ptr);
> 	do {
> 		if (cond(old))
> 			return false;
> 
> 		new = func(old);
> 	} while (!try_cmpxchg(ptr, &old, new));
> 
> 	return true;
> 
> In this case we rely on old being 'coherent'. The more obvious case is
> where it returns old (also not uncommon), but even if it just checks a
> (multi-bit) condition on old you don't want tearing.

It isn't as though READ_ONCE() is expensive.

For kernel/device driver code, while CSE is useful, you pretty
much always want the compiler to always do loads into local
variables.
It is rather a shame there isn't a compiler option that
avoids these unusual any annoying operations.

Since the current 'rules' seem to require READ_ONCE() and
WRITE_ONCE() be used as pairs, why not make the data 'volatile'?
That ought to be the same as using volatile casts on all accesses.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[GIT,PULL] pipe: nonblocking rw for io_uring

Pull-request

Message

Comments