Message ID | 20230421-seilbahn-vorpreschen-bd73ac3c88d7@brauner (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | [GIT,PULL] pipe: nonblocking rw for io_uring | expand |
On Fri, Apr 21, 2023 at 7:02 AM Christian Brauner <brauner@kernel.org> wrote: > > This contains Jens' work to support FMODE_NOWAIT and thus IOCB_NOWAIT > for pipes ensuring that all places can deal with non-blocking requests. > > To this end, pass down the information that this is a nonblocking > request so that pipe locking, allocation, and buffer checking correctly > deal with those. Ok, I pulled this, but then I unpulled it again. Doing conditional locking for O_NONBLOCK and friends is not ok. Yes, it's been done, and I may even have let some slip through, but it's just WRONG. There is absolutely no situation where a "ok, so the lock on this data structure was taken, we'll go to some async model" is worth it. Every single time I've seen this, it's been some developer who thinks that O_NONBLOCk is somehow some absolute "this cannot schedule" thing. And every single time it's been broken and horrid crap that just made everything more complicated and slowed things down. If some lock wait is a real problem, then the lock needs to be just fixed. Not "ok, let's make a non-blocking version and fall back if it's held". Note that FMODE_NOWAIT does not mean (and *CANNOT* mean) that you'd somehow be able to do the IO in some atomic context anyway. Many of our kernel locks don't even support that (eg mutexes). So thinking that FMODE_NOWAIT is that kind of absolute is the wrong kind of thinking entirely. FMODE_NOWAIT should mean that no *IO* gets done. And yes, that might mean that allocations fail too. But not this kind of "let's turn locking into 'trylock' stuff". The whoe flag is misnamed. It should have been FMODE_NOIO, the same way we have IOCB_NOIO. If you want FMODE_ATOMIC, then that is something entirely and completely different, and is probably crazy. We have done it in one area (RCU pathname lookup), and it was worth it there, and it was a *huge* undertaking. It was worth it, but it was worth it because it was a serious thing with some serious design and a critical area. Not this kind of "conditional trylock" garbage which just means that people will require 'poll()' to now add the lock to the waitqueue, or make all callers go into some "let's use a different thread instead" logic. Linus
On 4/24/23 3:05?PM, Linus Torvalds wrote: > On Fri, Apr 21, 2023 at 7:02?AM Christian Brauner <brauner@kernel.org> wrote: >> >> This contains Jens' work to support FMODE_NOWAIT and thus IOCB_NOWAIT >> for pipes ensuring that all places can deal with non-blocking requests. >> >> To this end, pass down the information that this is a nonblocking >> request so that pipe locking, allocation, and buffer checking correctly >> deal with those. > > Ok, I pulled this, but then I unpulled it again. > > Doing conditional locking for O_NONBLOCK and friends is not ok. Yes, > it's been done, and I may even have let some slip through, but it's > just WRONG. > > There is absolutely no situation where a "ok, so the lock on this data > structure was taken, we'll go to some async model" is worth it. > > Every single time I've seen this, it's been some developer who thinks > that O_NONBLOCk is somehow some absolute "this cannot schedule" thing. > And every single time it's been broken and horrid crap that just made > everything more complicated and slowed things down. > > If some lock wait is a real problem, then the lock needs to be just > fixed. Not "ok, let's make a non-blocking version and fall back if > it's held". > > Note that FMODE_NOWAIT does not mean (and *CANNOT* mean) that you'd > somehow be able to do the IO in some atomic context anyway. Many of > our kernel locks don't even support that (eg mutexes). > > So thinking that FMODE_NOWAIT is that kind of absolute is the wrong > kind of thinking entirely. > > FMODE_NOWAIT should mean that no *IO* gets done. And yes, that might > mean that allocations fail too. But not this kind of "let's turn > locking into 'trylock' stuff". > > The whoe flag is misnamed. It should have been FMODE_NOIO, the same > way we have IOCB_NOIO. > > If you want FMODE_ATOMIC, then that is something entirely and > completely different, and is probably crazy. > > We have done it in one area (RCU pathname lookup), and it was worth it > there, and it was a *huge* undertaking. It was worth it, but it was > worth it because it was a serious thing with some serious design and a > critical area. > > Not this kind of "conditional trylock" garbage which just means that > people will require 'poll()' to now add the lock to the waitqueue, or > make all callers go into some "let's use a different thread instead" > logic. I'm fully agreeing with you on the semantics for FMODE_NOWAIT/IOCB_NOWAIT, and also agree that they are misnamed and really should be _NOIO. There is no conceptual misunderstanding there, nor do I care about atomic semantics. Obviously the above is for io_uring to be able to improve the performance on pipes, because right now everything is punted to io-wq and that severly hampers performance with how pipes are generally used. io_uring doesn't care about atomic context, but it very much cares about NOT sleeping for IO during issue as that has adverse performance implications. If we don't ever wait for IO with the pipe lock held, then we can skip the conditional locking. But with splice, that's not at all the case! We most certainly wait for IO there with the pipe lock held. And yes I totally agree that conditional locking is not pretty, but for some cases it really is a necessary evil. People have complained about io_uring pipe performance, and I ran some testing myself. Being able to sanely read/write from/to pipes without punting to io-wq is an easy 10x improvement for that use case. How can I enable FMODE_NOWAIT on pipes if we have splice holding the pipe lock over IO? It's not feasible. So please reconsider, doing IO to/from pipes efficiently is actually kind of useful for io_uring...
On Mon, Apr 24, 2023 at 2:22 PM Jens Axboe <axboe@kernel.dk> wrote: > > If we don't ever wait for IO with the pipe lock held, then we can skip > the conditional locking. But with splice, that's not at all the case! We > most certainly wait for IO there with the pipe lock held. I think that then needs to just be fixed. I really think that trylock due to "nonblocking" IO is fundamentally wrong. Thinking that you need it is just a sign of something else being very wrong. That "very wrong" thing might well be splice then not honoring non-blocking IO on a non-blocking pipe. And I completely refuse to add that trylock hack to paper that over. The pipe lock is *not* meant for IO. Linus
On 4/24/23 3:37?PM, Linus Torvalds wrote: > On Mon, Apr 24, 2023 at 2:22?PM Jens Axboe <axboe@kernel.dk> wrote: >> >> If we don't ever wait for IO with the pipe lock held, then we can skip >> the conditional locking. But with splice, that's not at all the case! We >> most certainly wait for IO there with the pipe lock held. > > I think that then needs to just be fixed. I took another look at this, and the main issue is in fact splice confirming buffers. So I do think that we can make this work by simply having the non-block nature of it being passed down the ->confirm() callback as that's the one that'll be waiting for IO. If we have that, then we can disregard the pipe locking as we won't be holding it over IO. Which is what part of this series does, notably patch 1. Only other oddity is pipe_to_sendpage(), which we can probably sanely ignore. IOW, would you be fine with a v2 of this pull request where patch 2 drops the conditional locking and just passes it to ->confirm()? That's certainly sane, and just makes the ultimate page locking conditional to avoid waiting on IO. I'd really hate to still be missing out on pipe performance with io_uring.
On Mon, Apr 24, 2023 at 2:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I completely refuse to add that trylock hack to paper that over. > The pipe lock is *not* meant for IO. If you want to paper it over, do it other ways. I'd love to just magically fix splice, but hey, that might not be possible. But possible fixes papering this over might be to make splice "poison a pipe, and make io_uring falls back on io workers only on pipes that do splice. Make any normal pipe read/write load sane. And no, don't worry about races. If you have the same pipe used for io_uring IO *and* somebody else then doing splice on it and racing, just take the loss and tell people that they might hit a slow case if they do stupid things. Basically, the patch might look like something like - do_pipe() sets FMODE_NOWAIT by default when creating a pipe - splice then clears FMODE_NOWAIT on pipes as they are used and now io_uring sees whether the pipe is playing nice or not. As far as I can tell, something like that would make the 'pipe_buf_confirm()' part unnecessary too, since that's only relevant for splice. A fancier version might be to only do that "splice then clears FMODE_NOWAIT" thing if the other side of the splice has not set FMODE_NOWAIT. Honestly, if the problem is "pipe IO is slow", then splice should not be the thing you optimize for. Linus
On Mon, Apr 24, 2023 at 2:55 PM Jens Axboe <axboe@kernel.dk> wrote: > > I took another look at this, and the main issue is in fact splice > confirming buffers. So I do think that we can make this work by simply > having the non-block nature of it being passed down the ->confirm() > callback as that's the one that'll be waiting for IO. If we have that, > then we can disregard the pipe locking as we won't be holding it over > IO. Ok, that part looks fine to me. The pipe_buf_confirm() part of the series I don't find problematic, it's really conditional locking that I absolutely detest and has always been a sign of problems elsewhere. Linus
On 4/24/23 3:55?PM, Jens Axboe wrote: > On 4/24/23 3:37?PM, Linus Torvalds wrote: >> On Mon, Apr 24, 2023 at 2:22?PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> If we don't ever wait for IO with the pipe lock held, then we can skip >>> the conditional locking. But with splice, that's not at all the case! We >>> most certainly wait for IO there with the pipe lock held. >> >> I think that then needs to just be fixed. > > I took another look at this, and the main issue is in fact splice > confirming buffers. So I do think that we can make this work by simply > having the non-block nature of it being passed down the ->confirm() > callback as that's the one that'll be waiting for IO. If we have that, > then we can disregard the pipe locking as we won't be holding it over > IO. > > Which is what part of this series does, notably patch 1. > > Only other oddity is pipe_to_sendpage(), which we can probably sanely > ignore. > > IOW, would you be fine with a v2 of this pull request where patch 2 > drops the conditional locking and just passes it to ->confirm()? That's > certainly sane, and just makes the ultimate page locking conditional to > avoid waiting on IO. I'd really hate to still be missing out on pipe > performance with io_uring. I guess that would still have blocking if you have someone doing splice in a blocking fashion, and someone else trying to do RWF_NOWAIT reads or writes to the pipe... The very thing the conditional pipe locking would sort out. Only fool proof alternative would seem to be having splice use a specific pipe lock rather then pipe->mutex. And honestly pipes and splice are so tied together than I'm not sure that doing separate locking would be feasible.
On 4/24/23 4:00?PM, Linus Torvalds wrote: > On Mon, Apr 24, 2023 at 2:55?PM Jens Axboe <axboe@kernel.dk> wrote: >> >> I took another look at this, and the main issue is in fact splice >> confirming buffers. So I do think that we can make this work by simply >> having the non-block nature of it being passed down the ->confirm() >> callback as that's the one that'll be waiting for IO. If we have that, >> then we can disregard the pipe locking as we won't be holding it over >> IO. > > Ok, that part looks fine to me. > > The pipe_buf_confirm() part of the series I don't find problematic, > it's really conditional locking that I absolutely detest and has > always been a sign of problems elsewhere. Agree, the conditional locking is the ugly bit for sure. I'll reply to your other email as my followup to my message discovered that this isn't enough due to mixed splice/non-splice usage.
On 4/24/23 3:58?PM, Linus Torvalds wrote: > On Mon, Apr 24, 2023 at 2:37?PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> And I completely refuse to add that trylock hack to paper that over. >> The pipe lock is *not* meant for IO. > > If you want to paper it over, do it other ways. > > I'd love to just magically fix splice, but hey, that might not be possible. Don't think it is... At least not trivially. > But possible fixes papering this over might be to make splice "poison > a pipe, and make io_uring falls back on io workers only on pipes that > do splice. Make any normal pipe read/write load sane. > > And no, don't worry about races. If you have the same pipe used for > io_uring IO *and* somebody else then doing splice on it and racing, > just take the loss and tell people that they might hit a slow case if > they do stupid things. > > Basically, the patch might look like something like > > - do_pipe() sets FMODE_NOWAIT by default when creating a pipe > > - splice then clears FMODE_NOWAIT on pipes as they are used > > and now io_uring sees whether the pipe is playing nice or not. > > As far as I can tell, something like that would make the > 'pipe_buf_confirm()' part unnecessary too, since that's only relevant > for splice. > > A fancier version might be to only do that "splice then clears > FMODE_NOWAIT" thing if the other side of the splice has not set > FMODE_NOWAIT. > > Honestly, if the problem is "pipe IO is slow", then splice should not > be the thing you optimize for. I think that'd be an acceptable approach, and would at least fix the pure pipe case which I suspect is 99.9% of them, if not more. And yes, it'd mean that we don't need to do the ->confirm() change either, as the pipe is already tainted at that point. I'll respin a v2, post, and send in later this merge window.
On 4/24/23 4:07?PM, Jens Axboe wrote: > On 4/24/23 3:58?PM, Linus Torvalds wrote: >> On Mon, Apr 24, 2023 at 2:37?PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> And I completely refuse to add that trylock hack to paper that over. >>> The pipe lock is *not* meant for IO. >> >> If you want to paper it over, do it other ways. >> >> I'd love to just magically fix splice, but hey, that might not be possible. > > Don't think it is... At least not trivially. > >> But possible fixes papering this over might be to make splice "poison >> a pipe, and make io_uring falls back on io workers only on pipes that >> do splice. Make any normal pipe read/write load sane. >> >> And no, don't worry about races. If you have the same pipe used for >> io_uring IO *and* somebody else then doing splice on it and racing, >> just take the loss and tell people that they might hit a slow case if >> they do stupid things. >> >> Basically, the patch might look like something like >> >> - do_pipe() sets FMODE_NOWAIT by default when creating a pipe >> >> - splice then clears FMODE_NOWAIT on pipes as they are used >> >> and now io_uring sees whether the pipe is playing nice or not. >> >> As far as I can tell, something like that would make the >> 'pipe_buf_confirm()' part unnecessary too, since that's only relevant >> for splice. >> >> A fancier version might be to only do that "splice then clears >> FMODE_NOWAIT" thing if the other side of the splice has not set >> FMODE_NOWAIT. >> >> Honestly, if the problem is "pipe IO is slow", then splice should not >> be the thing you optimize for. > > I think that'd be an acceptable approach, and would at least fix the > pure pipe case which I suspect is 99.9% of them, if not more. And yes, > it'd mean that we don't need to do the ->confirm() change either, as the > pipe is already tainted at that point. > > I'll respin a v2, post, and send in later this merge window. Something like this. Not tested yet, but wanted to get your feedback early to avoid issues down the line. Really just the first hunk, as we're not really expecting f_mode to change and hence could just use an xchg(). Seems saner to use a cmpxchg though, but maybe that's too much of both belt and suspenders? commit f10cbebf4dd7f77b0e87c77bb0191a5a07d5e7ac Author: Jens Axboe <axboe@kernel.dk> Date: Mon Apr 24 16:32:55 2023 -0600 splice: clear FMODE_NOWAIT on file if splice/vmsplice is used In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice clear that file flag. Splice holds the pipe lock around IO and cannot easily be refactored to avoid that, as splice and pipes are inherently tied together. By clearing FMODE_NOWAIT if splice is being used on a pipe, other users of the pipe will know that the pipe is no longer safe for RWF_NOWAIT and friends. Signed-off-by: Jens Axboe <axboe@kernel.dk> diff --git a/fs/splice.c b/fs/splice.c index 2c3dec2b6dfa..3ce7c07621e2 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -37,6 +37,29 @@ #include "internal.h" +/* + * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to + * indicate they support non-blocking reads or writes, we must clear it + * here if set to avoid blocking other users of this pipe if splice is + * being done on it. + */ +static void pipe_clear_nowait(struct file *file) +{ + fmode_t new_fmode, old_fmode; + + /* + * We could get by with just doing an xchg() here as f_mode should + * not change in the file's lifetime, but let's be safe and use + * a cmpxchg() instead. + */ + do { + old_fmode = READ_ONCE(file->f_mode); + if (!(old_fmode & FMODE_NOWAIT)) + break; + new_fmode = old_fmode & ~FMODE_NOWAIT; + } while (!try_cmpxchg(&file->f_mode, &old_fmode, new_fmode)); +} + /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -1211,10 +1234,16 @@ static long __do_splice(struct file *in, loff_t __user *off_in, ipipe = get_pipe_info(in, true); opipe = get_pipe_info(out, true); - if (ipipe && off_in) - return -ESPIPE; - if (opipe && off_out) - return -ESPIPE; + if (ipipe) { + if (off_in) + return -ESPIPE; + pipe_clear_nowait(in); + } + if (opipe) { + if (off_out) + return -ESPIPE; + pipe_clear_nowait(out); + } if (off_out) { if (copy_from_user(&offset, off_out, sizeof(loff_t))) @@ -1311,6 +1340,8 @@ static long vmsplice_to_user(struct file *file, struct iov_iter *iter, if (!pipe) return -EBADF; + pipe_clear_nowait(file); + if (sd.total_len) { pipe_lock(pipe); ret = __splice_from_pipe(pipe, &sd, pipe_to_user); @@ -1339,6 +1370,8 @@ static long vmsplice_to_pipe(struct file *file, struct iov_iter *iter, if (!pipe) return -EBADF; + pipe_clear_nowait(file); + pipe_lock(pipe); ret = wait_for_space(pipe, flags); if (!ret)
On Mon, Apr 24, 2023 at 3:45 PM Jens Axboe <axboe@kernel.dk> wrote: > > Something like this. Not tested yet, but wanted to get your feedback > early to avoid issues down the line. Ok, that try_cmpxchg() loop looks odd, but I guess it's the right thing to do. That said, you should move the old_fmode = READ_ONCE(file->f_mode); to outside the loop, because try_cmpxchg() will then update 'old_fmode' to the proper value and it should *not* be done again. I also suspect that the READ_ONCE() itself is entirely superfluous, because that is very much a value that we should let the compiler optimize to *not* have to access more than it needs to. So if the compiler had an earlier copy of that value, it should just use it, rather than us forcing it to read it again. But I suspect in this case it makes no real difference to code generation. There's nothing else around it that uses f_mode, afaik, and the try_cmpxchg() will reload it properly to fix any possible races up. The READ_ONCE() would possibly make more sense if we actually expected that FMODE_NOWAIT bit to change more than once, but then we'd presuably have some ordering rule and it should be a smp_load_acquire() or whatever. As it is, if we ever see it clear, we don't care any more, and the real value consistency guarantee is in the try_cmpxchg itself. There are no possible races ot mis-readings that could matter. So I think it could/should just be something like void pipe_clear_nowait(struct file *file) { fmode_t fmode = file->f_mode; do { if (!(fmode & FMODE_NOWAIT)) break; } while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT)); } which sadly generates that big constant just because FMODE_NOWAIT is up in the high bits and with the 'try_cmpxchg()', the compiler has no choice but to get the full 32-bit value anyway. Linus
On 4/24/23 9:16?PM, Linus Torvalds wrote: > On Mon, Apr 24, 2023 at 3:45?PM Jens Axboe <axboe@kernel.dk> wrote: >> >> Something like this. Not tested yet, but wanted to get your feedback >> early to avoid issues down the line. > > Ok, that try_cmpxchg() loop looks odd, but I guess it's the right thing to do. > > That said, you should move the > > old_fmode = READ_ONCE(file->f_mode); > > to outside the loop, because try_cmpxchg() will then update > 'old_fmode' to the proper value and it should *not* be done again. > > I also suspect that the READ_ONCE() itself is entirely superfluous, > because that is very much a value that we should let the compiler > optimize to *not* have to access more than it needs to. > > So if the compiler had an earlier copy of that value, it should just > use it, rather than us forcing it to read it again. > > But I suspect in this case it makes no real difference to code > generation. There's nothing else around it that uses f_mode, afaik, > and the try_cmpxchg() will reload it properly to fix any possible > races up. > > The READ_ONCE() would possibly make more sense if we actually expected > that FMODE_NOWAIT bit to change more than once, but then we'd > presuably have some ordering rule and it should be a > smp_load_acquire() or whatever. > > As it is, if we ever see it clear, we don't care any more, and the > real value consistency guarantee is in the try_cmpxchg itself. There > are no possible races ot mis-readings that could matter. > > So I think it could/should just be something like > > void pipe_clear_nowait(struct file *file) > { > fmode_t fmode = file->f_mode; > do { > if (!(fmode & FMODE_NOWAIT)) > break; > } while (!try_cmpxchg(&file->f_mode, &fmode, fmode & ~FMODE_NOWAIT)); > } > > which sadly generates that big constant just because FMODE_NOWAIT is > up in the high bits and with the 'try_cmpxchg()', the compiler has no > choice but to get the full 32-bit value anyway. I'll go with this, it's definitely simpler. I suspected the important bit was just doing the cmpxchg sanely and that the READ_ONCE() was superflous given how it's used, and dropping the old_fmode is cleaner. FWIW, I don't see any difference in code generation here on arm64 or x86-64 if FMODE_NOWAIT is in the lower bits, as we could've just moved it. We could make it the sign bit which might make the first compare faster in general, but honestly don't think we really care that deeply about that. Updated the branch, it's: https://git.kernel.dk/cgit/linux/log/?h=pipe-nonblock.2 It's just the cmpxchg patch now and the same "set FMODE_NOWAIT on pipes unconditionally" from before.
On Mon, Apr 24, 2023 at 8:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That said, you should move the > > old_fmode = READ_ONCE(file->f_mode); > > to outside the loop, because try_cmpxchg() will then update > 'old_fmode' to the proper value and it should *not* be done again. > > I also suspect that the READ_ONCE() itself is entirely superfluous, > because that is very much a value that we should let the compiler > optimize to *not* have to access more than it needs to. I ended up looking around a bit, and the READ_ONCE() before the "try_cmpxchg()" loop is definitely our common pattern. However, I'm adding in the locking people, because I think that pattern is wrong and historical. I think it comes from the original cmpxchg loop model, where we did the read inside the loop, and the READ_ONCE() was some kind of "let's make sure it's updated every time" thing (which should be unnecessary due to memory clobbers on the cmpxchg, but whatever... Inside the loop, the READ_ONCE() also makes more sense in general, in that there isn't any sane point in merging it with earlier values, so there's no downside. But the more I look at it, the more I'm convinced that our pattern of old = READ_ONCE(rq->fence.error); do { if (fatal_error(old)) return false; } while (!try_cmpxchg(&rq->fence.error, &old, error)); (to pick one random user) is simply horribly wrong. If we have code where we had an earlier load of that same value (or an earlier store, for that matter - anything where the compiler can assume some starting value), then the READ_ONCE() only adds pointless overhead. And if we don't have it, then the READ_ONCE() doesn't add any value, since it doesn't imply any ordering. IOW, I think the READ_ONCE() pattern is either pointless or detrimental. I see no upside. So I think we should make the pattern be used with a try_cmpxchg() loop to be either a proper *ordered* read (ie something like "smp_load_acquire()" for reading the original value), or we should just let the compiler read it any which way it wants. Not the READ_ONCE() pattern we seem to have. But I'm adding the locking maintainers to this email to make sure that I'm not missing anything. PeterZ in particular, who added that new (and hugely improved!) try_cmpxchg() interface in the first place. Linus
On Tue, Apr 25, 2023 at 10:20:31AM -0700, Linus Torvalds wrote: > But the more I look at it, the more I'm convinced that our pattern of > > old = READ_ONCE(rq->fence.error); > do { > if (fatal_error(old)) > return false; > } while (!try_cmpxchg(&rq->fence.error, &old, error)); > > (to pick one random user) is simply horribly wrong. The last time this came up I shared your view however Mark argued for the READ_ONCE() thusly: https://lore.kernel.org/all/Y71LoCIl+IFdy9D8@FVFF77S0Q05N/T/#u
On Tue, Apr 25, 2023 at 12:49 PM Peter Zijlstra <peterz@infradead.org> wrote: > > The last time this came up I shared your view however Mark argued for > the READ_ONCE() thusly: > > https://lore.kernel.org/all/Y71LoCIl+IFdy9D8@FVFF77S0Q05N/T/#u Hmm. Yes, I think Mark is right. It's not that 'old' might be wrong - that doesn't matter because cmpxchg will work it out - it's just that 'new' might not be consistent with the old value we then use. Ok. I'll try to remember this, but maybe it might be worth documenting. Jens - I don't think this actually matters for the f_mode value issue, since the only thing that might change is that FMODE_NOWAIT bit, but I was clearly wrong on READ_ONCE(). So that loop should have it, just to have the right pattern after all. My bad. Linus
On 4/25/23 1:58?PM, Linus Torvalds wrote: > Jens - I don't think this actually matters for the f_mode value issue, > since the only thing that might change is that FMODE_NOWAIT bit, but I > was clearly wrong on READ_ONCE(). So that loop should have it, just to > have the right pattern after all. Noted, I'll update it so it's consistent with the other use cases. Thanks!
On Tue, Apr 25, 2023 at 12:58 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok. I'll try to remember this, but maybe it might be worth documenting. We might document it by just making it clear that it's not that we want to read it "once", it's that we want to have a stable value. There might be other situations where that is all we want. IOW, maybe we could have something like #define READ_STABLE(x) \ ({ __auto_type __val = (x); OPTIMIZER_HIDE_VAR(__val); __val; }) instead - although from a quick look, the code generation is pretty much exactly the same. I dunno. Just throwing that idea out there as a "if reading _once_ isn't the issue, maybe we shouldn't make the code look like it is"... Linus
On Sun, May 07, 2023 at 04:04:23PM +0200, Jonas Oberhauser wrote: > > Am 4/25/2023 um 9:58 PM schrieb Linus Torvalds: > > Yes, I think Mark is right. It's not that 'old' might be wrong - that > > doesn't matter because cmpxchg will work it out - it's just that 'new' > > might not be consistent with the old value we then use. > > In the general pattern, besides the potential issue raised by Mark, tearing > may also be an issue (longer example inspired by a case we met at the end of > the mail) where 'old' being wrong matters. There is yet another pattern where it actually matters: old = READ_ONCE(*ptr); do { if (cond(old)) return false; new = func(old); } while (!try_cmpxchg(ptr, &old, new)); return true; In this case we rely on old being 'coherent'. The more obvious case is where it returns old (also not uncommon), but even if it just checks a (multi-bit) condition on old you don't want tearing.
From: Peter Zijlstra > Sent: 08 May 2023 09:39 > > On Sun, May 07, 2023 at 04:04:23PM +0200, Jonas Oberhauser wrote: > > > > Am 4/25/2023 um 9:58 PM schrieb Linus Torvalds: > > > Yes, I think Mark is right. It's not that 'old' might be wrong - that > > > doesn't matter because cmpxchg will work it out - it's just that 'new' > > > might not be consistent with the old value we then use. > > > > In the general pattern, besides the potential issue raised by Mark, tearing > > may also be an issue (longer example inspired by a case we met at the end of > > the mail) where 'old' being wrong matters. > > There is yet another pattern where it actually matters: > > old = READ_ONCE(*ptr); > do { > if (cond(old)) > return false; > > new = func(old); > } while (!try_cmpxchg(ptr, &old, new)); > > return true; > > In this case we rely on old being 'coherent'. The more obvious case is > where it returns old (also not uncommon), but even if it just checks a > (multi-bit) condition on old you don't want tearing. It isn't as though READ_ONCE() is expensive. For kernel/device driver code, while CSE is useful, you pretty much always want the compiler to always do loads into local variables. It is rather a shame there isn't a compiler option that avoids these unusual any annoying operations. Since the current 'rules' seem to require READ_ONCE() and WRITE_ONCE() be used as pairs, why not make the data 'volatile'? That ought to be the same as using volatile casts on all accesses. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)