Message ID | 20241205120332.1578562-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] fs: elide the smp_rmb fence in fd_install() | expand |
On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > void fd_install(unsigned int fd, struct file *file) > { > - struct files_struct *files = current->files; > + struct files_struct *files; > struct fdtable *fdt; > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > return; > > + /* > + * Synchronized with expand_fdtable(), see that routine for an > + * explanation. > + */ > rcu_read_lock_sched(); > + files = READ_ONCE(current->files); What are you trying to do with that READ_ONCE()? current->files itself is *not* changed by any of that code; current->files->fdtab is.
On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > void fd_install(unsigned int fd, struct file *file) > > { > > - struct files_struct *files = current->files; > > + struct files_struct *files; > > struct fdtable *fdt; > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > return; > > > > + /* > > + * Synchronized with expand_fdtable(), see that routine for an > > + * explanation. > > + */ > > rcu_read_lock_sched(); > > + files = READ_ONCE(current->files); > > What are you trying to do with that READ_ONCE()? current->files > itself is *not* changed by any of that code; current->files->fdtab is. To my understanding this is the idiomatic way of spelling out the non-existent in Linux smp_consume_load, for the resize_in_progress flag. Anyway to elaborate I'm gunning for a setup where the code is semantically equivalent to having a lock around the work. Pretend ->resize_lock exists, then: fd_install: files = current->files; read_lock(files->resize_lock); fdt = rcu_dereference_sched(files->fdt); rcu_assign_pointer(fdt->fd[fd], file); read_unlock(files->resize_lock); expand_fdtable: write_lock(files->resize_lock); [snip] rcu_assign_pointer(files->fdt, new_fdt); write_unlock(files->resize_lock); Except rcu_read_lock_sched + appropriately fenced resize_in_progress + synchronize_rcu do it.
On Thu 05-12-24 13:03:32, Mateusz Guzik wrote: > See the added commentary for reasoning. > > ->resize_in_progress handling is moved inside of expand_fdtable() for > clarity. > > Whacks an actual fence on arm64. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Hum, I don't think this works. What could happen now is: CPU1 CPU2 expand_fdtable() fd_install() files->resize_in_progress = true; ... if (atomic_read(&files->count) > 1) synchronize_rcu(); ... rcu_assign_pointer(files->fdt, new_fdt); if (cur_fdt != &files->fdtab) call_rcu(&cur_fdt->rcu, free_fdtable_rcu); rcu_read_lock_sched() fdt = rcu_dereference_sched(files->fdt); /* Fetched old FD table - without * smp_rmb() the read was reordered */ rcu_assign_pointer(files->fdt, new_fdt); /* * Publish everything before we unset ->resize_in_progress, see above * for an explanation. */ smp_wmb(); out: files->resize_in_progress = false; if (unlikely(files->resize_in_progress)) { - false rcu_assign_pointer(fdt->fd[fd], file); - store in the old table - boom. Honza > diff --git a/fs/file.c b/fs/file.c > index 019fb9acf91b..d065a24980da 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr) > __acquires(files->file_lock) > { > struct fdtable *new_fdt, *cur_fdt; > + int err = 0; > > + BUG_ON(files->resize_in_progress); > + files->resize_in_progress = true; > spin_unlock(&files->file_lock); > new_fdt = alloc_fdtable(nr + 1); > > - /* make sure all fd_install() have seen resize_in_progress > - * or have finished their rcu_read_lock_sched() section. > + /* > + * Synchronize against the lockless fd_install(). > + * > + * All work in that routine is enclosed with RCU sched section. > + * > + * We published ->resize_in_progress = true with the unlock above, > + * which makes new arrivals bail to locked operation. > + * > + * Now we only need to wait for CPUs which did not observe the flag to > + * leave and make sure their store to the fd table got published. > + * > + * We do it with synchronize_rcu(), which both waits for all sections to > + * finish (taking care of the first part) and guarantees all CPUs issued a > + * full fence (taking care of the second part). > + * > + * Note we know there is nobody to wait for if we are dealing with a > + * single-threaded process. > */ > if (atomic_read(&files->count) > 1) > synchronize_rcu(); > > spin_lock(&files->file_lock); > - if (IS_ERR(new_fdt)) > - return PTR_ERR(new_fdt); > + if (IS_ERR(new_fdt)) { > + err = PTR_ERR(new_fdt); > + goto out; > + } > cur_fdt = files_fdtable(files); > BUG_ON(nr < cur_fdt->max_fds); > copy_fdtable(new_fdt, cur_fdt); > rcu_assign_pointer(files->fdt, new_fdt); > if (cur_fdt != &files->fdtab) > call_rcu(&cur_fdt->rcu, free_fdtable_rcu); > - /* coupled with smp_rmb() in fd_install() */ > + > + /* > + * Publish everything before we unset ->resize_in_progress, see above > + * for an explanation. > + */ > smp_wmb(); > - return 0; > +out: > + files->resize_in_progress = false; > + return err; > } > > /* > @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr) > return -EMFILE; > > /* All good, so we try */ > - files->resize_in_progress = true; > error = expand_fdtable(files, nr); > - files->resize_in_progress = false; > > wake_up_all(&files->resize_wait); > return error; > @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd); > > void fd_install(unsigned int fd, struct file *file) > { > - struct files_struct *files = current->files; > + struct files_struct *files; > struct fdtable *fdt; > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > return; > > + /* > + * Synchronized with expand_fdtable(), see that routine for an > + * explanation. > + */ > rcu_read_lock_sched(); > + files = READ_ONCE(current->files); > > if (unlikely(files->resize_in_progress)) { > rcu_read_unlock_sched(); > @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file) > spin_unlock(&files->file_lock); > return; > } > - /* coupled with smp_wmb() in expand_fdtable() */ > - smp_rmb(); > + > fdt = rcu_dereference_sched(files->fdt); > BUG_ON(fdt->fd[fd] != NULL); > rcu_assign_pointer(fdt->fd[fd], file); > -- > 2.43.0 >
On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > See the added commentary for reasoning. > > ->resize_in_progress handling is moved inside of expand_fdtable() for > clarity. > > Whacks an actual fence on arm64. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- > > To my reading of commentary above synchronize_rcu() this works fine(tm) > and there is even other code relying on the same idea (percpu rwsems > (see percpu_down_read for example), maybe there is more). > > However, given that barriers like to be tricky and I know about squat of > RCU internals, I refer to Paul here. > > Paul, does this work? If not, any trivial tweaks to make it so? > > I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and > the specific patch does not work. > > fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 39 insertions(+), 11 deletions(-) > > diff --git a/fs/file.c b/fs/file.c > index 019fb9acf91b..d065a24980da 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr) > __acquires(files->file_lock) > { > struct fdtable *new_fdt, *cur_fdt; > + int err = 0; > > + BUG_ON(files->resize_in_progress); I think this BUG_ON() here is a bit unnecessary. > + files->resize_in_progress = true; Minor: Why move that into expand_fdtable()? Having files->resize_in_progress in here doesn't add much more clarity than just having it set around expand_fdtable() in the caller. Leaving it there also makes the patch smaller and clearer to follow as you neither need the additional err nor the goto. > spin_unlock(&files->file_lock); > new_fdt = alloc_fdtable(nr + 1); > > - /* make sure all fd_install() have seen resize_in_progress > - * or have finished their rcu_read_lock_sched() section. > + /* > + * Synchronize against the lockless fd_install(). > + * > + * All work in that routine is enclosed with RCU sched section. > + * > + * We published ->resize_in_progress = true with the unlock above, > + * which makes new arrivals bail to locked operation. > + * > + * Now we only need to wait for CPUs which did not observe the flag to > + * leave and make sure their store to the fd table got published. > + * > + * We do it with synchronize_rcu(), which both waits for all sections to > + * finish (taking care of the first part) and guarantees all CPUs issued a > + * full fence (taking care of the second part). > + * > + * Note we know there is nobody to wait for if we are dealing with a > + * single-threaded process. > */ > if (atomic_read(&files->count) > 1) > synchronize_rcu(); > > spin_lock(&files->file_lock); > - if (IS_ERR(new_fdt)) > - return PTR_ERR(new_fdt); > + if (IS_ERR(new_fdt)) { > + err = PTR_ERR(new_fdt); > + goto out; > + } > cur_fdt = files_fdtable(files); > BUG_ON(nr < cur_fdt->max_fds); > copy_fdtable(new_fdt, cur_fdt); > rcu_assign_pointer(files->fdt, new_fdt); > if (cur_fdt != &files->fdtab) > call_rcu(&cur_fdt->rcu, free_fdtable_rcu); > - /* coupled with smp_rmb() in fd_install() */ > + > + /* > + * Publish everything before we unset ->resize_in_progress, see above > + * for an explanation. > + */ > smp_wmb(); > - return 0; > +out: > + files->resize_in_progress = false; > + return err; > } > > /* > @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr) > return -EMFILE; > > /* All good, so we try */ > - files->resize_in_progress = true; > error = expand_fdtable(files, nr); > - files->resize_in_progress = false; > > wake_up_all(&files->resize_wait); > return error; > @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd); > > void fd_install(unsigned int fd, struct file *file) > { > - struct files_struct *files = current->files; > + struct files_struct *files; > struct fdtable *fdt; > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > return; > > + /* > + * Synchronized with expand_fdtable(), see that routine for an > + * explanation. > + */ > rcu_read_lock_sched(); > + files = READ_ONCE(current->files); > > if (unlikely(files->resize_in_progress)) { > rcu_read_unlock_sched(); > @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file) > spin_unlock(&files->file_lock); > return; > } > - /* coupled with smp_wmb() in expand_fdtable() */ > - smp_rmb(); > + > fdt = rcu_dereference_sched(files->fdt); > BUG_ON(fdt->fd[fd] != NULL); > rcu_assign_pointer(fdt->fd[fd], file); > -- > 2.43.0 >
On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 05-12-24 13:03:32, Mateusz Guzik wrote: > > See the added commentary for reasoning. > > > > ->resize_in_progress handling is moved inside of expand_fdtable() for > > clarity. > > > > Whacks an actual fence on arm64. > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > Hum, I don't think this works. What could happen now is: > > CPU1 CPU2 > expand_fdtable() fd_install() > files->resize_in_progress = true; > ... > if (atomic_read(&files->count) > 1) > synchronize_rcu(); > ... > rcu_assign_pointer(files->fdt, new_fdt); > if (cur_fdt != &files->fdtab) > call_rcu(&cur_fdt->rcu, free_fdtable_rcu); > > rcu_read_lock_sched() > > fdt = rcu_dereference_sched(files->fdt); > /* Fetched old FD table - without > * smp_rmb() the read was reordered */ > rcu_assign_pointer(files->fdt, new_fdt); > /* > * Publish everything before we unset ->resize_in_progress, see above > * for an explanation. > */ > smp_wmb(); > out: > files->resize_in_progress = false; > if (unlikely(files->resize_in_progress)) { > - false > rcu_assign_pointer(fdt->fd[fd], file); > - store in the old table - boom. > I don't believe this ordering is possible because of both synchronize_rcu and the fence before updating resize_in_progress. Any CPU which could try racing like that had to come in after the synchronize_rcu() call, meaning one of the 3 possibilities: - the flag is true and the fd table is old - the flag is true and the fd table is new - the flag is false and the fd table is new Suppose the CPU reordered loads of the flag and the fd table. There is no ordering in which it can see both the old table and the unset flag. -- Mateusz Guzik <mjguzik gmail.com>
On Thu, Dec 5, 2024 at 3:58 PM Christian Brauner <brauner@kernel.org> wrote: > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > + BUG_ON(files->resize_in_progress); > > I think this BUG_ON() here is a bit unnecessary. > I'm an assert-friendly guy, but I'm not going to argue about this one. > > + files->resize_in_progress = true; > > Minor: Why move that into expand_fdtable()? Having > files->resize_in_progress in here doesn't add much more clarity than > just having it set around expand_fdtable() in the caller. > > Leaving it there also makes the patch smaller and clearer to follow as > you neither need the additional err nor the goto. The is indeed not the best looking goto in the world, but per the commit message I moved the flag in so that it can be easier referred to when describing synchronisation against fd_install.
On Thu 05-12-24 16:01:07, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 05-12-24 13:03:32, Mateusz Guzik wrote: > > > See the added commentary for reasoning. > > > > > > ->resize_in_progress handling is moved inside of expand_fdtable() for > > > clarity. > > > > > > Whacks an actual fence on arm64. > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > > Hum, I don't think this works. What could happen now is: > > > > CPU1 CPU2 > > expand_fdtable() fd_install() > > files->resize_in_progress = true; > > ... > > if (atomic_read(&files->count) > 1) > > synchronize_rcu(); > > ... > > rcu_assign_pointer(files->fdt, new_fdt); > > if (cur_fdt != &files->fdtab) > > call_rcu(&cur_fdt->rcu, free_fdtable_rcu); > > > > rcu_read_lock_sched() > > > > fdt = rcu_dereference_sched(files->fdt); > > /* Fetched old FD table - without > > * smp_rmb() the read was reordered */ > > rcu_assign_pointer(files->fdt, new_fdt); > > /* > > * Publish everything before we unset ->resize_in_progress, see above > > * for an explanation. > > */ > > smp_wmb(); > > out: > > files->resize_in_progress = false; > > if (unlikely(files->resize_in_progress)) { > > - false > > rcu_assign_pointer(fdt->fd[fd], file); > > - store in the old table - boom. > > > > I don't believe this ordering is possible because of both > synchronize_rcu and the fence before updating resize_in_progress. > > Any CPU which could try racing like that had to come in after the > synchronize_rcu() call, meaning one of the 3 possibilities: > - the flag is true and the fd table is old > - the flag is true and the fd table is new > - the flag is false and the fd table is new I agree here. > Suppose the CPU reordered loads of the flag and the fd table. There is > no ordering in which it can see both the old table and the unset flag. But I disagree here. If the reads are reordered, then the fd table read can happen during the "flag is true and the fd table is old" state and the flag read can happen later in "flag is false and the fd table is new" state. Just as I outlined above... Honza
On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote: > On Thu 05-12-24 16:01:07, Mateusz Guzik wrote: > > Suppose the CPU reordered loads of the flag and the fd table. There is > > no ordering in which it can see both the old table and the unset flag. > > But I disagree here. If the reads are reordered, then the fd table read can > happen during the "flag is true and the fd table is old" state and the flag > read can happen later in "flag is false and the fd table is new" state. > Just as I outlined above... In your example all the work happens *after* synchronize_rcu(). The thread resizing the table already published the result even before calling into it. Furthermore by the time synchronize_rcu returns everyone is guaranteed to have issued a full fence. Meaning nobody can see the flag as unset.
On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > > void fd_install(unsigned int fd, struct file *file) > > > { > > > - struct files_struct *files = current->files; > > > + struct files_struct *files; > > > struct fdtable *fdt; > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > > return; > > > > > > + /* > > > + * Synchronized with expand_fdtable(), see that routine for an > > > + * explanation. > > > + */ > > > rcu_read_lock_sched(); > > > + files = READ_ONCE(current->files); > > > > What are you trying to do with that READ_ONCE()? current->files > > itself is *not* changed by any of that code; current->files->fdtab is. > > To my understanding this is the idiomatic way of spelling out the > non-existent in Linux smp_consume_load, for the resize_in_progress > flag. In Linus, "smp_consume_load()" is named rcu_dereference(). > Anyway to elaborate I'm gunning for a setup where the code is > semantically equivalent to having a lock around the work. Except that rcu_read_lock_sched() provides mutual-exclusion guarantees only with later RCU grace periods, such as those implemented by synchronize_rcu(). > Pretend ->resize_lock exists, then: > fd_install: > files = current->files; > read_lock(files->resize_lock); > fdt = rcu_dereference_sched(files->fdt); > rcu_assign_pointer(fdt->fd[fd], file); > read_unlock(files->resize_lock); > > expand_fdtable: > write_lock(files->resize_lock); > [snip] > rcu_assign_pointer(files->fdt, new_fdt); > write_unlock(files->resize_lock); > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress + > synchronize_rcu do it. OK, good, you did get the grace-period part of the puzzle. Howver, please keep in mind that synchronize_rcu() has significant latency by design. There is a tradeoff between CPU consumption and latency, and synchronize_rcu() therefore has latencies ranging upwards of several milliseconds (not microseconds or nanoseconds). I would be very surprised if expand_fdtable() users would be happy with such a long delay. Or are you using some trick to hide this delay?
On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote: > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > > > void fd_install(unsigned int fd, struct file *file) > > > > { > > > > - struct files_struct *files = current->files; > > > > + struct files_struct *files; > > > > struct fdtable *fdt; > > > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > > > return; > > > > > > > > + /* > > > > + * Synchronized with expand_fdtable(), see that routine for an > > > > + * explanation. > > > > + */ > > > > rcu_read_lock_sched(); > > > > + files = READ_ONCE(current->files); > > > > > > What are you trying to do with that READ_ONCE()? current->files > > > itself is *not* changed by any of that code; current->files->fdtab is. > > > > To my understanding this is the idiomatic way of spelling out the > > non-existent in Linux smp_consume_load, for the resize_in_progress > > flag. > > In Linus, "smp_consume_load()" is named rcu_dereference(). > ok > > Anyway to elaborate I'm gunning for a setup where the code is > > semantically equivalent to having a lock around the work. > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees > only with later RCU grace periods, such as those implemented by > synchronize_rcu(). > To my understanding the pre-case is already with the flag set upfront and waiting for everyone to finish (which is already taking place in stock code) + looking at it within the section. > > Pretend ->resize_lock exists, then: > > fd_install: > > files = current->files; > > read_lock(files->resize_lock); > > fdt = rcu_dereference_sched(files->fdt); > > rcu_assign_pointer(fdt->fd[fd], file); > > read_unlock(files->resize_lock); > > > > expand_fdtable: > > write_lock(files->resize_lock); > > [snip] > > rcu_assign_pointer(files->fdt, new_fdt); > > write_unlock(files->resize_lock); > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress + > > synchronize_rcu do it. > > OK, good, you did get the grace-period part of the puzzle. > > Howver, please keep in mind that synchronize_rcu() has significant > latency by design. There is a tradeoff between CPU consumption and > latency, and synchronize_rcu() therefore has latencies ranging upwards of > several milliseconds (not microseconds or nanoseconds). I would be very > surprised if expand_fdtable() users would be happy with such a long delay. The call is already there since 2015 and I only know of one case where someone took an issue with it (and it could have been sorted out with dup2 upfront to grow the table to the desired size). Amusingly I see you patched it in 2018 from synchronize_sched to synchronize_rcu. Bottom line though is that I'm not *adding* it. latency here. :) So assuming the above can be ignored, do you confirm the patch works (even if it needs some cosmetic changes)? The entirety of the patch is about removing smp_rmb in fd_install with small code rearrangement, while relying on the machinery which is already there.
On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote: > > > To my understanding this is the idiomatic way of spelling out the > > non-existent in Linux smp_consume_load, for the resize_in_progress > > flag. > > In Linus, "smp_consume_load()" is named rcu_dereference(). Linux. But yes and no. It's worth making it really really clear that "rcu_dereference()" is *not* just a different name for some "smp_consume_load()" operation. Why? Because a true smp_consume_load() would work with any random kind of flags etc. And rcu_dereference() works only because it's a pointer, and there's an inherent data dependency to what the result points to. Paul obviously knows this, but let's make it very clear in this discussion, because if somebody decided "I want a smp_consume_load(), and I'll use rcu_dereference() to do that", the end result would simply not work for arbitrary data, like a flags field or something, where comparing it against a value will only result in a control dependency, not an actual data dependency. Linus
On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > To my understanding this is the idiomatic way of spelling out the > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > flag. > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > Linux. > > But yes and no. > > It's worth making it really really clear that "rcu_dereference()" is > *not* just a different name for some "smp_consume_load()" operation. > > Why? Because a true smp_consume_load() would work with any random kind > of flags etc. And rcu_dereference() works only because it's a pointer, > and there's an inherent data dependency to what the result points to. > > Paul obviously knows this, but let's make it very clear in this > discussion, because if somebody decided "I want a smp_consume_load(), > and I'll use rcu_dereference() to do that", the end result would > simply not work for arbitrary data, like a flags field or something, > where comparing it against a value will only result in a control > dependency, not an actual data dependency. > So I checked for kicks and rcu_dereference comes with type checking, as in passing something which is not a pointer even fails to compile. I'll note thought that a smp_load_consume_ptr or similarly named routine would be nice and I'm rather confused why it was not added given smp_load_acquire and smp_store_release being there. One immediate user would be mnt_idmap(), like so: iff --git a/include/linux/mount.h b/include/linux/mount.h index 33f17b6e8732..4d3486ff67ed 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -76,7 +76,7 @@ struct vfsmount { static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt) { /* Pairs with smp_store_release() in do_idmap_mount(). */ - return READ_ONCE(mnt->mnt_idmap); + return smp_load_consume_ptr(mnt->mnt_idmap); } extern int mnt_want_write(struct vfsmount *mnt);
On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote: > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > > > > void fd_install(unsigned int fd, struct file *file) > > > > > { > > > > > - struct files_struct *files = current->files; > > > > > + struct files_struct *files; > > > > > struct fdtable *fdt; > > > > > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > > > > return; > > > > > > > > > > + /* > > > > > + * Synchronized with expand_fdtable(), see that routine for an > > > > > + * explanation. > > > > > + */ > > > > > rcu_read_lock_sched(); > > > > > + files = READ_ONCE(current->files); > > > > > > > > What are you trying to do with that READ_ONCE()? current->files > > > > itself is *not* changed by any of that code; current->files->fdtab is. > > > > > > To my understanding this is the idiomatic way of spelling out the > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > flag. > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > ok And rcu_dereference(), and for that matter memory_order_consume, only orders the load of the pointer against subsequent dereferences of that same pointer against dereferences of that same pointer preceding the store of that pointer. T1 T2 a: p->a = 1; d: q = rcu_dereference(gp); b: r1 = p->b; e: r2 = p->a; c: rcu_assign_pointer(gp, p); f: p->b = 42; Here, if (and only if!) T2's load into q gets the value stored by T1, then T1's statements e and f are guaranteed to happen after T2's statements a and b. In your patch, I do not see this pattern for the files->resize_in_progress flag. > > > Anyway to elaborate I'm gunning for a setup where the code is > > > semantically equivalent to having a lock around the work. > > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees > > only with later RCU grace periods, such as those implemented by > > synchronize_rcu(). > > To my understanding the pre-case is already with the flag set upfront > and waiting for everyone to finish (which is already taking place in > stock code) + looking at it within the section. I freely confess that I do not understand the purpose of assigning to files->resize_in_progress both before (pre-existing) and within (added) expand_fdtable(). If the assignments before and after the call to expand_fdtable() and the checks were under that lock, that could work, but removing that lockless check might have performance and scalability consequences. > > > Pretend ->resize_lock exists, then: > > > fd_install: > > > files = current->files; > > > read_lock(files->resize_lock); > > > fdt = rcu_dereference_sched(files->fdt); > > > rcu_assign_pointer(fdt->fd[fd], file); > > > read_unlock(files->resize_lock); > > > > > > expand_fdtable: > > > write_lock(files->resize_lock); > > > [snip] > > > rcu_assign_pointer(files->fdt, new_fdt); > > > write_unlock(files->resize_lock); > > > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress + > > > synchronize_rcu do it. > > > > OK, good, you did get the grace-period part of the puzzle. > > > > Howver, please keep in mind that synchronize_rcu() has significant > > latency by design. There is a tradeoff between CPU consumption and > > latency, and synchronize_rcu() therefore has latencies ranging upwards of > > several milliseconds (not microseconds or nanoseconds). I would be very > > surprised if expand_fdtable() users would be happy with such a long delay. > > The call is already there since 2015 and I only know of one case where > someone took an issue with it (and it could have been sorted out with > dup2 upfront to grow the table to the desired size). Amusingly I see > you patched it in 2018 from synchronize_sched to synchronize_rcu. > Bottom line though is that I'm not *adding* it. latency here. :) Are you saying that the smp_rmb() is unnecessary? It doesn't seem like you are saying that, because otherwise your patch could simply remove it without additional code changes. On the other hand, if it is a key component of the synchronization, I don't see how that smp_rmb() can be removed while still preserving that synchronization without adding another synchronize_rcu() to that function to compensate. Now, it might be that you are somehow cleverly reusing the pre-existing synchronize_rcu(), but I am not immediately seeing how this would work. And no, I do not recall making that particular change back in the day, only that I did change all the calls to synchronize_sched() to synchronize_rcu(). Please accept my apologies for my having failed to meet your expectations. And do not be too surprised if others have similar expectations of you at some point in the future. ;-) > So assuming the above can be ignored, do you confirm the patch works > (even if it needs some cosmetic changes)? > > The entirety of the patch is about removing smp_rmb in fd_install with > small code rearrangement, while relying on the machinery which is > already there. The code to be synchronized is fairly small. So why don't you create a litmus test and ask herd7? Please see tools/memory-model for documentation and other example litmus tests. This tool does the moral equivalent of a full state-space search of the litmus tests, telling you whether your "exists" condition is always, sometimes, or never satisfied. Thnax, Paul
On Thu, Dec 05, 2024 at 11:26:35AM -0800, Linus Torvalds wrote: > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > To my understanding this is the idiomatic way of spelling out the > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > flag. > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > Linux. One of those days... ;-) > But yes and no. > > It's worth making it really really clear that "rcu_dereference()" is > *not* just a different name for some "smp_consume_load()" operation. > > Why? Because a true smp_consume_load() would work with any random kind > of flags etc. And rcu_dereference() works only because it's a pointer, > and there's an inherent data dependency to what the result points to. > > Paul obviously knows this, but let's make it very clear in this > discussion, because if somebody decided "I want a smp_consume_load(), > and I'll use rcu_dereference() to do that", the end result would > simply not work for arbitrary data, like a flags field or something, > where comparing it against a value will only result in a control > dependency, not an actual data dependency. Fair points! And Linus (and Linux, for that matter) equally obviously already knows this, but please note also that an smp_load_consume() would still order only later dereferences of the thing returned from smp_load_consume(), which means that it pretty much needs to be a pointer. (Yes, in theory, it could be an array index, but in practice compilers know way too much about integer arithmetic for this to be advisable.) Thanx, Paul
On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > To my understanding this is the idiomatic way of spelling out the > > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > > flag. > > > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > > > Linux. > > > > But yes and no. > > > > It's worth making it really really clear that "rcu_dereference()" is > > *not* just a different name for some "smp_consume_load()" operation. > > > > Why? Because a true smp_consume_load() would work with any random kind > > of flags etc. And rcu_dereference() works only because it's a pointer, > > and there's an inherent data dependency to what the result points to. > > > > Paul obviously knows this, but let's make it very clear in this > > discussion, because if somebody decided "I want a smp_consume_load(), > > and I'll use rcu_dereference() to do that", the end result would > > simply not work for arbitrary data, like a flags field or something, > > where comparing it against a value will only result in a control > > dependency, not an actual data dependency. > > So I checked for kicks and rcu_dereference comes with type checking, > as in passing something which is not a pointer even fails to compile. That is by design, keeping in mind that consume loads order only later dereferences against the pointer load. > I'll note thought that a smp_load_consume_ptr or similarly named > routine would be nice and I'm rather confused why it was not added > given smp_load_acquire and smp_store_release being there. In recent kernels, READ_ONCE() is your smp_load_consume_ptr(). Or things like rcu_dereference_check(p, 1). But these can be used only when the pointed-to object is guaranteed to live (as in not be freed) for the full duration of the read-side use of that pointer. > One immediate user would be mnt_idmap(), like so: > iff --git a/include/linux/mount.h b/include/linux/mount.h > index 33f17b6e8732..4d3486ff67ed 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -76,7 +76,7 @@ struct vfsmount { > static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt) > { > /* Pairs with smp_store_release() in do_idmap_mount(). */ > - return READ_ONCE(mnt->mnt_idmap); > + return smp_load_consume_ptr(mnt->mnt_idmap); > } > > extern int mnt_want_write(struct vfsmount *mnt); These would have the same semantics. And in v6.12, this is instead smp_load_acquire(). Thanx, Paul
On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote: > > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote: > > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > > > > > void fd_install(unsigned int fd, struct file *file) > > > > > > { > > > > > > - struct files_struct *files = current->files; > > > > > > + struct files_struct *files; > > > > > > struct fdtable *fdt; > > > > > > > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > > > > > return; > > > > > > > > > > > > + /* > > > > > > + * Synchronized with expand_fdtable(), see that routine for an > > > > > > + * explanation. > > > > > > + */ > > > > > > rcu_read_lock_sched(); > > > > > > + files = READ_ONCE(current->files); > > > > > > > > > > What are you trying to do with that READ_ONCE()? current->files > > > > > itself is *not* changed by any of that code; current->files->fdtab is. > > > > > > > > To my understanding this is the idiomatic way of spelling out the > > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > > flag. > > > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > > > ok > > And rcu_dereference(), and for that matter memory_order_consume, only > orders the load of the pointer against subsequent dereferences of that > same pointer against dereferences of that same pointer preceding the > store of that pointer. > > T1 T2 > a: p->a = 1; d: q = rcu_dereference(gp); > b: r1 = p->b; e: r2 = p->a; > c: rcu_assign_pointer(gp, p); f: p->b = 42; > > Here, if (and only if!) T2's load into q gets the value stored by > T1, then T1's statements e and f are guaranteed to happen after T2's > statements a and b. In your patch, I do not see this pattern for the > files->resize_in_progress flag. > > > > > Anyway to elaborate I'm gunning for a setup where the code is > > > > semantically equivalent to having a lock around the work. > > > > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees > > > only with later RCU grace periods, such as those implemented by > > > synchronize_rcu(). > > > > To my understanding the pre-case is already with the flag set upfront > > and waiting for everyone to finish (which is already taking place in > > stock code) + looking at it within the section. > > I freely confess that I do not understand the purpose of assigning to > files->resize_in_progress both before (pre-existing) and within (added) > expand_fdtable(). If the assignments before and after the call to > expand_fdtable() and the checks were under that lock, that could work, > but removing that lockless check might have performance and scalability > consequences. > > > > > Pretend ->resize_lock exists, then: > > > > fd_install: > > > > files = current->files; > > > > read_lock(files->resize_lock); > > > > fdt = rcu_dereference_sched(files->fdt); > > > > rcu_assign_pointer(fdt->fd[fd], file); > > > > read_unlock(files->resize_lock); > > > > > > > > expand_fdtable: > > > > write_lock(files->resize_lock); > > > > [snip] > > > > rcu_assign_pointer(files->fdt, new_fdt); > > > > write_unlock(files->resize_lock); > > > > > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress + > > > > synchronize_rcu do it. > > > > > > OK, good, you did get the grace-period part of the puzzle. > > > > > > Howver, please keep in mind that synchronize_rcu() has significant > > > latency by design. There is a tradeoff between CPU consumption and > > > latency, and synchronize_rcu() therefore has latencies ranging upwards of > > > several milliseconds (not microseconds or nanoseconds). I would be very > > > surprised if expand_fdtable() users would be happy with such a long delay. > > > > The call is already there since 2015 and I only know of one case where > > someone took an issue with it (and it could have been sorted out with > > dup2 upfront to grow the table to the desired size). Amusingly I see > > you patched it in 2018 from synchronize_sched to synchronize_rcu. > > Bottom line though is that I'm not *adding* it. latency here. :) > > Are you saying that the smp_rmb() is unnecessary? It doesn't seem like > you are saying that, because otherwise your patch could simply remove > it without additional code changes. On the other hand, if it is a key > component of the synchronization, I don't see how that smp_rmb() can be > removed while still preserving that synchronization without adding another > synchronize_rcu() to that function to compensate. > > Now, it might be that you are somehow cleverly reusing the pre-existing > synchronize_rcu(), but I am not immediately seeing how this would work. > > And no, I do not recall making that particular change back in the > day, only that I did change all the calls to synchronize_sched() to > synchronize_rcu(). Please accept my apologies for my having failed > to meet your expectations. And do not be too surprised if others have > similar expectations of you at some point in the future. ;-) > > > So assuming the above can be ignored, do you confirm the patch works > > (even if it needs some cosmetic changes)? > > > > The entirety of the patch is about removing smp_rmb in fd_install with > > small code rearrangement, while relying on the machinery which is > > already there. > > The code to be synchronized is fairly small. So why don't you > create a litmus test and ask herd7? Please see tools/memory-model for > documentation and other example litmus tests. This tool does the moral > equivalent of a full state-space search of the litmus tests, telling you > whether your "exists" condition is always, sometimes, or never satisfied. > I think there is quite a degree of talking past each other in this thread. I was not aware of herd7. Testing the thing with it sounds like a plan to get out of it, so I'm going to do it and get back to you in a day or two. Worst case the patch is a bust, best case the fence is already of no use.
On Thu, Dec 05, 2024 at 09:15:14PM +0100, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote: > > > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote: > > > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote: > > > > > > > void fd_install(unsigned int fd, struct file *file) > > > > > > > { > > > > > > > - struct files_struct *files = current->files; > > > > > > > + struct files_struct *files; > > > > > > > struct fdtable *fdt; > > > > > > > > > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) > > > > > > > return; > > > > > > > > > > > > > > + /* > > > > > > > + * Synchronized with expand_fdtable(), see that routine for an > > > > > > > + * explanation. > > > > > > > + */ > > > > > > > rcu_read_lock_sched(); > > > > > > > + files = READ_ONCE(current->files); > > > > > > > > > > > > What are you trying to do with that READ_ONCE()? current->files > > > > > > itself is *not* changed by any of that code; current->files->fdtab is. > > > > > > > > > > To my understanding this is the idiomatic way of spelling out the > > > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > > > flag. > > > > > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > > > > > ok > > > > And rcu_dereference(), and for that matter memory_order_consume, only > > orders the load of the pointer against subsequent dereferences of that > > same pointer against dereferences of that same pointer preceding the > > store of that pointer. > > > > T1 T2 > > a: p->a = 1; d: q = rcu_dereference(gp); > > b: r1 = p->b; e: r2 = p->a; > > c: rcu_assign_pointer(gp, p); f: p->b = 42; > > > > Here, if (and only if!) T2's load into q gets the value stored by > > T1, then T1's statements e and f are guaranteed to happen after T2's > > statements a and b. In your patch, I do not see this pattern for the > > files->resize_in_progress flag. > > > > > > > Anyway to elaborate I'm gunning for a setup where the code is > > > > > semantically equivalent to having a lock around the work. > > > > > > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees > > > > only with later RCU grace periods, such as those implemented by > > > > synchronize_rcu(). > > > > > > To my understanding the pre-case is already with the flag set upfront > > > and waiting for everyone to finish (which is already taking place in > > > stock code) + looking at it within the section. > > > > I freely confess that I do not understand the purpose of assigning to > > files->resize_in_progress both before (pre-existing) and within (added) > > expand_fdtable(). If the assignments before and after the call to > > expand_fdtable() and the checks were under that lock, that could work, > > but removing that lockless check might have performance and scalability > > consequences. > > > > > > > Pretend ->resize_lock exists, then: > > > > > fd_install: > > > > > files = current->files; > > > > > read_lock(files->resize_lock); > > > > > fdt = rcu_dereference_sched(files->fdt); > > > > > rcu_assign_pointer(fdt->fd[fd], file); > > > > > read_unlock(files->resize_lock); > > > > > > > > > > expand_fdtable: > > > > > write_lock(files->resize_lock); > > > > > [snip] > > > > > rcu_assign_pointer(files->fdt, new_fdt); > > > > > write_unlock(files->resize_lock); > > > > > > > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress + > > > > > synchronize_rcu do it. > > > > > > > > OK, good, you did get the grace-period part of the puzzle. > > > > > > > > Howver, please keep in mind that synchronize_rcu() has significant > > > > latency by design. There is a tradeoff between CPU consumption and > > > > latency, and synchronize_rcu() therefore has latencies ranging upwards of > > > > several milliseconds (not microseconds or nanoseconds). I would be very > > > > surprised if expand_fdtable() users would be happy with such a long delay. > > > > > > The call is already there since 2015 and I only know of one case where > > > someone took an issue with it (and it could have been sorted out with > > > dup2 upfront to grow the table to the desired size). Amusingly I see > > > you patched it in 2018 from synchronize_sched to synchronize_rcu. > > > Bottom line though is that I'm not *adding* it. latency here. :) > > > > Are you saying that the smp_rmb() is unnecessary? It doesn't seem like > > you are saying that, because otherwise your patch could simply remove > > it without additional code changes. On the other hand, if it is a key > > component of the synchronization, I don't see how that smp_rmb() can be > > removed while still preserving that synchronization without adding another > > synchronize_rcu() to that function to compensate. > > > > Now, it might be that you are somehow cleverly reusing the pre-existing > > synchronize_rcu(), but I am not immediately seeing how this would work. > > > > And no, I do not recall making that particular change back in the > > day, only that I did change all the calls to synchronize_sched() to > > synchronize_rcu(). Please accept my apologies for my having failed > > to meet your expectations. And do not be too surprised if others have > > similar expectations of you at some point in the future. ;-) > > > > > So assuming the above can be ignored, do you confirm the patch works > > > (even if it needs some cosmetic changes)? > > > > > > The entirety of the patch is about removing smp_rmb in fd_install with > > > small code rearrangement, while relying on the machinery which is > > > already there. > > > > The code to be synchronized is fairly small. So why don't you > > create a litmus test and ask herd7? Please see tools/memory-model for > > documentation and other example litmus tests. This tool does the moral > > equivalent of a full state-space search of the litmus tests, telling you > > whether your "exists" condition is always, sometimes, or never satisfied. > > > > I think there is quite a degree of talking past each other in this thread. > > I was not aware of herd7. Testing the thing with it sounds like a plan > to get out of it, so I'm going to do it and get back to you in a day > or two. Worst case the patch is a bust, best case the fence is already > of no use. Very good! My grouchiness earlier in this thread notwithstanding, I am happy to review your litmus tests. (You will likely need more than one.) And the inevitable unsolicited advice: Make those litmus tests as small as you possibly can. Full-state-space search is extremely powerful, but it does not scale very well. Thanx, Paul
On Thu, Dec 05, 2024 at 12:11:57PM -0800, Paul E. McKenney wrote: > On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote: > > On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > To my understanding this is the idiomatic way of spelling out the > > > > > non-existent in Linux smp_consume_load, for the resize_in_progress > > > > > flag. > > > > > > > > In Linus, "smp_consume_load()" is named rcu_dereference(). > > > > > > Linux. > > > > > > But yes and no. > > > > > > It's worth making it really really clear that "rcu_dereference()" is > > > *not* just a different name for some "smp_consume_load()" operation. > > > > > > Why? Because a true smp_consume_load() would work with any random kind > > > of flags etc. And rcu_dereference() works only because it's a pointer, > > > and there's an inherent data dependency to what the result points to. > > > > > > Paul obviously knows this, but let's make it very clear in this > > > discussion, because if somebody decided "I want a smp_consume_load(), > > > and I'll use rcu_dereference() to do that", the end result would > > > simply not work for arbitrary data, like a flags field or something, > > > where comparing it against a value will only result in a control > > > dependency, not an actual data dependency. > > > > So I checked for kicks and rcu_dereference comes with type checking, > > as in passing something which is not a pointer even fails to compile. > > That is by design, keeping in mind that consume loads order only > later dereferences against the pointer load. > > > I'll note thought that a smp_load_consume_ptr or similarly named > > routine would be nice and I'm rather confused why it was not added > > given smp_load_acquire and smp_store_release being there. > > In recent kernels, READ_ONCE() is your smp_load_consume_ptr(). Or things > like rcu_dereference_check(p, 1). But these can be used only when the > pointed-to object is guaranteed to live (as in not be freed) for the > full duration of the read-side use of that pointer. Both true in the case of mnt_idmap(). All mounts start with mnt_idmap set to: extern struct mnt_idmap nop_mnt_idmap which doesn't go away ever. And we only allow to change the idmaping of a mount once. So the READ_ONCE() will always retrieve an object that is guaranteed to stay alive for at least as long as the mount stays alive (in the nop_mnt_idmap case obviously "forever"). I wanted to avoid a) pushing complicated RCU dances all through filesystems and the VFS and b) taking any reference counts whatsoever on mnt_idmap (other than sharing the same mnt_idmap between different mounts at creation time). (Though I do have long-standing ideas how that would work without changing the mnt_idmap pointer.).
On Thu 05-12-24 16:36:40, Mateusz Guzik wrote: > On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 05-12-24 16:01:07, Mateusz Guzik wrote: > > > Suppose the CPU reordered loads of the flag and the fd table. There is > > > no ordering in which it can see both the old table and the unset flag. > > > > But I disagree here. If the reads are reordered, then the fd table read can > > happen during the "flag is true and the fd table is old" state and the flag > > read can happen later in "flag is false and the fd table is new" state. > > Just as I outlined above... Ugh, I might be missing something obvious so please bear with me. > In your example all the work happens *after* synchronize_rcu(). Correct. > The thread resizing the table already published the result even before > calling into it. Really? You proposed expand_table() does: BUG_ON(files->resize_in_progress); files->resize_in_progress = true; spin_unlock(&files->file_lock); new_fdt = alloc_fdtable(nr + 1); if (atomic_read(&files->count) > 1) synchronize_rcu(); spin_lock(&files->file_lock); if (IS_ERR(new_fdt)) { err = PTR_ERR(new_fdt); goto out; } cur_fdt = files_fdtable(files); BUG_ON(nr < cur_fdt->max_fds); copy_fdtable(new_fdt, cur_fdt); rcu_assign_pointer(files->fdt, new_fdt); if (cur_fdt != &files->fdtab) call_rcu(&cur_fdt->rcu, free_fdtable_rcu); smp_wmb(); out: files->resize_in_progress = false; return err; So synchronize_rcu() is called very early AFAICT. At that point we have allocated the new table but copy & store in files->fdt happens *after* synchronize_rcu() has finished. So the copy & store can be racing with fd_install() calling rcu_read_lock_sched() and prefetching files->fdt (but not files->resize_in_progress!) into a local CPU cache. > Furthermore by the time synchronize_rcu returns > everyone is guaranteed to have issued a full fence. Meaning nobody can > see the flag as unset. Well, nobody can see the flag unset only until expand_fdtable() reaches: smp_wmb(); out: files->resize_in_progress = false; And smp_wmb() doesn't give you much unless the reads of files->resize_in_progress and files->fdt are ordered somehow on the other side (i.e., in fd_install()). But I'm looking forward to the Litmus test to resolve our discussion :) Honza
diff --git a/fs/file.c b/fs/file.c index 019fb9acf91b..d065a24980da 100644 --- a/fs/file.c +++ b/fs/file.c @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr) __acquires(files->file_lock) { struct fdtable *new_fdt, *cur_fdt; + int err = 0; + BUG_ON(files->resize_in_progress); + files->resize_in_progress = true; spin_unlock(&files->file_lock); new_fdt = alloc_fdtable(nr + 1); - /* make sure all fd_install() have seen resize_in_progress - * or have finished their rcu_read_lock_sched() section. + /* + * Synchronize against the lockless fd_install(). + * + * All work in that routine is enclosed with RCU sched section. + * + * We published ->resize_in_progress = true with the unlock above, + * which makes new arrivals bail to locked operation. + * + * Now we only need to wait for CPUs which did not observe the flag to + * leave and make sure their store to the fd table got published. + * + * We do it with synchronize_rcu(), which both waits for all sections to + * finish (taking care of the first part) and guarantees all CPUs issued a + * full fence (taking care of the second part). + * + * Note we know there is nobody to wait for if we are dealing with a + * single-threaded process. */ if (atomic_read(&files->count) > 1) synchronize_rcu(); spin_lock(&files->file_lock); - if (IS_ERR(new_fdt)) - return PTR_ERR(new_fdt); + if (IS_ERR(new_fdt)) { + err = PTR_ERR(new_fdt); + goto out; + } cur_fdt = files_fdtable(files); BUG_ON(nr < cur_fdt->max_fds); copy_fdtable(new_fdt, cur_fdt); rcu_assign_pointer(files->fdt, new_fdt); if (cur_fdt != &files->fdtab) call_rcu(&cur_fdt->rcu, free_fdtable_rcu); - /* coupled with smp_rmb() in fd_install() */ + + /* + * Publish everything before we unset ->resize_in_progress, see above + * for an explanation. + */ smp_wmb(); - return 0; +out: + files->resize_in_progress = false; + return err; } /* @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr) return -EMFILE; /* All good, so we try */ - files->resize_in_progress = true; error = expand_fdtable(files, nr); - files->resize_in_progress = false; wake_up_all(&files->resize_wait); return error; @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd); void fd_install(unsigned int fd, struct file *file) { - struct files_struct *files = current->files; + struct files_struct *files; struct fdtable *fdt; if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING))) return; + /* + * Synchronized with expand_fdtable(), see that routine for an + * explanation. + */ rcu_read_lock_sched(); + files = READ_ONCE(current->files); if (unlikely(files->resize_in_progress)) { rcu_read_unlock_sched(); @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file) spin_unlock(&files->file_lock); return; } - /* coupled with smp_wmb() in expand_fdtable() */ - smp_rmb(); + fdt = rcu_dereference_sched(files->fdt); BUG_ON(fdt->fd[fd] != NULL); rcu_assign_pointer(fdt->fd[fd], file);
See the added commentary for reasoning. ->resize_in_progress handling is moved inside of expand_fdtable() for clarity. Whacks an actual fence on arm64. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- To my reading of commentary above synchronize_rcu() this works fine(tm) and there is even other code relying on the same idea (percpu rwsems (see percpu_down_read for example), maybe there is more). However, given that barriers like to be tricky and I know about squat of RCU internals, I refer to Paul here. Paul, does this work? If not, any trivial tweaks to make it so? I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and the specific patch does not work. fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 39 insertions(+), 11 deletions(-)