[RFC] fs: elide the smp_rmb fence in fd_install()

Message ID	20241205120332.1578562-1-mjguzik@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D852B1F541D; Thu, 5 Dec 2024 12:04:35 +0000 (UTC) From: Mateusz Guzik <mjguzik@gmail.com> To: paulmck@kernel.org Cc: brauner@kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, linux-fsdevel@vger.kernel.org, torvalds@linux-foundation.org, edumazet@google.com, linux-kernel@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Date: Thu, 5 Dec 2024 13:03:32 +0100 Message-ID: <20241205120332.1578562-1-mjguzik@gmail.com> In-Reply-To: <CAGudoHG6zYMfFmhizJDPAw=CF8QY8dzbvg0cSEW4XVcvTYhELw@mail.gmail.com> References: <CAGudoHG6zYMfFmhizJDPAw=CF8QY8dzbvg0cSEW4XVcvTYhELw@mail.gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC] fs: elide the smp_rmb fence in fd_install() \| expand [RFC] fs: elide the smp_rmb fence in fd_install()

Mateusz Guzik Dec. 5, 2024, 12:03 p.m. UTC

See the added commentary for reasoning.

->resize_in_progress handling is moved inside of expand_fdtable() for
clarity.

Whacks an actual fence on arm64.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

To my reading of commentary above synchronize_rcu() this works fine(tm)
and there is even other code relying on the same idea (percpu rwsems
(see percpu_down_read for example), maybe there is more).

However, given that barriers like to be tricky and I know about squat of
RCU internals, I refer to Paul here.

Paul, does this work? If not, any trivial tweaks to make it so?

I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and
the specific patch does not work.

 fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 11 deletions(-)

Al Viro Dec. 5, 2024, 2:18 p.m. UTC | #1

On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);

What are you trying to do with that READ_ONCE()?  current->files
itself is *not* changed by any of that code; current->files->fdtab is.

Mateusz Guzik Dec. 5, 2024, 2:43 p.m. UTC | #2

On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> >  void fd_install(unsigned int fd, struct file *file)
> >  {
> > -     struct files_struct *files = current->files;
> > +     struct files_struct *files;
> >       struct fdtable *fdt;
> >
> >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> >               return;
> >
> > +     /*
> > +      * Synchronized with expand_fdtable(), see that routine for an
> > +      * explanation.
> > +      */
> >       rcu_read_lock_sched();
> > +     files = READ_ONCE(current->files);
>
> What are you trying to do with that READ_ONCE()?  current->files
> itself is *not* changed by any of that code; current->files->fdtab is.

To my understanding this is the idiomatic way of spelling out the
non-existent in Linux smp_consume_load, for the resize_in_progress
flag.

Anyway to elaborate I'm gunning for a setup where the code is
semantically equivalent to having a lock around the work.
Pretend ->resize_lock exists, then:
fd_install:
files = current->files;
read_lock(files->resize_lock);
fdt = rcu_dereference_sched(files->fdt);
rcu_assign_pointer(fdt->fd[fd], file);
read_unlock(files->resize_lock);

expand_fdtable:
write_lock(files->resize_lock);
[snip]
rcu_assign_pointer(files->fdt, new_fdt);
write_unlock(files->resize_lock);

Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
synchronize_rcu do it.

Jan Kara Dec. 5, 2024, 2:46 p.m. UTC | #3

On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> See the added commentary for reasoning.
> 
> ->resize_in_progress handling is moved inside of expand_fdtable() for
> clarity.
> 
> Whacks an actual fence on arm64.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

Hum, I don't think this works. What could happen now is:

CPU1					CPU2
expand_fdtable()			fd_install()
  files->resize_in_progress = true;
  ...
  if (atomic_read(&files->count) > 1)
    synchronize_rcu();
  ...
  rcu_assign_pointer(files->fdt, new_fdt);
  if (cur_fdt != &files->fdtab)
          call_rcu(&cur_fdt->rcu, free_fdtable_rcu);

					rcu_read_lock_sched()

					fdt = rcu_dereference_sched(files->fdt);
					/* Fetched old FD table - without
					 * smp_rmb() the read was reordered */
  rcu_assign_pointer(files->fdt, new_fdt);
  /*
   * Publish everything before we unset ->resize_in_progress, see above
   * for an explanation.
   */
  smp_wmb();
out:
  files->resize_in_progress = false;
					if (unlikely(files->resize_in_progress)) {
					  - false
					rcu_assign_pointer(fdt->fd[fd], file);
					  - store in the old table - boom.

								Honza

> diff --git a/fs/file.c b/fs/file.c
> index 019fb9acf91b..d065a24980da 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
>  	__acquires(files->file_lock)
>  {
>  	struct fdtable *new_fdt, *cur_fdt;
> +	int err = 0;
>  
> +	BUG_ON(files->resize_in_progress);
> +	files->resize_in_progress = true;
>  	spin_unlock(&files->file_lock);
>  	new_fdt = alloc_fdtable(nr + 1);
>  
> -	/* make sure all fd_install() have seen resize_in_progress
> -	 * or have finished their rcu_read_lock_sched() section.
> +	/*
> +	 * Synchronize against the lockless fd_install().
> +	 *
> +	 * All work in that routine is enclosed with RCU sched section.
> +	 *
> +	 * We published ->resize_in_progress = true with the unlock above,
> +	 * which makes new arrivals bail to locked operation.
> +	 *
> +	 * Now we only need to wait for CPUs which did not observe the flag to
> +	 * leave and make sure their store to the fd table got published.
> +	 *
> +	 * We do it with synchronize_rcu(), which both waits for all sections to
> +	 * finish (taking care of the first part) and guarantees all CPUs issued a
> +	 * full fence (taking care of the second part).
> +	 *
> +	 * Note we know there is nobody to wait for if we are dealing with a
> +	 * single-threaded process.
>  	 */
>  	if (atomic_read(&files->count) > 1)
>  		synchronize_rcu();
>  
>  	spin_lock(&files->file_lock);
> -	if (IS_ERR(new_fdt))
> -		return PTR_ERR(new_fdt);
> +	if (IS_ERR(new_fdt)) {
> +		err = PTR_ERR(new_fdt);
> +		goto out;
> +	}
>  	cur_fdt = files_fdtable(files);
>  	BUG_ON(nr < cur_fdt->max_fds);
>  	copy_fdtable(new_fdt, cur_fdt);
>  	rcu_assign_pointer(files->fdt, new_fdt);
>  	if (cur_fdt != &files->fdtab)
>  		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> -	/* coupled with smp_rmb() in fd_install() */
> +
> +	/*
> +	 * Publish everything before we unset ->resize_in_progress, see above
> +	 * for an explanation.
> +	 */
>  	smp_wmb();
> -	return 0;
> +out:
> +	files->resize_in_progress = false;
> +	return err;
>  }
>  
>  /*
> @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr)
>  		return -EMFILE;
>  
>  	/* All good, so we try */
> -	files->resize_in_progress = true;
>  	error = expand_fdtable(files, nr);
> -	files->resize_in_progress = false;
>  
>  	wake_up_all(&files->resize_wait);
>  	return error;
> @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd);
>  
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);
>  
>  	if (unlikely(files->resize_in_progress)) {
>  		rcu_read_unlock_sched();
> @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		spin_unlock(&files->file_lock);
>  		return;
>  	}
> -	/* coupled with smp_wmb() in expand_fdtable() */
> -	smp_rmb();
> +
>  	fdt = rcu_dereference_sched(files->fdt);
>  	BUG_ON(fdt->fd[fd] != NULL);
>  	rcu_assign_pointer(fdt->fd[fd], file);
> -- 
> 2.43.0
>

Christian Brauner Dec. 5, 2024, 2:58 p.m. UTC | #4

On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> See the added commentary for reasoning.
> 
> ->resize_in_progress handling is moved inside of expand_fdtable() for
> clarity.
> 
> Whacks an actual fence on arm64.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> To my reading of commentary above synchronize_rcu() this works fine(tm)
> and there is even other code relying on the same idea (percpu rwsems
> (see percpu_down_read for example), maybe there is more).
> 
> However, given that barriers like to be tricky and I know about squat of
> RCU internals, I refer to Paul here.
> 
> Paul, does this work? If not, any trivial tweaks to make it so?
> 
> I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and
> the specific patch does not work.
> 
>  fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 019fb9acf91b..d065a24980da 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
>  	__acquires(files->file_lock)
>  {
>  	struct fdtable *new_fdt, *cur_fdt;
> +	int err = 0;
>  
> +	BUG_ON(files->resize_in_progress);

I think this BUG_ON() here is a bit unnecessary.

> +	files->resize_in_progress = true;

Minor: Why move that into expand_fdtable()? Having
files->resize_in_progress in here doesn't add much more clarity than
just having it set around expand_fdtable() in the caller.

Leaving it there also makes the patch smaller and clearer to follow as
you neither need the additional err nor the goto.

>  	spin_unlock(&files->file_lock);
>  	new_fdt = alloc_fdtable(nr + 1);
>  
> -	/* make sure all fd_install() have seen resize_in_progress
> -	 * or have finished their rcu_read_lock_sched() section.
> +	/*
> +	 * Synchronize against the lockless fd_install().
> +	 *
> +	 * All work in that routine is enclosed with RCU sched section.
> +	 *
> +	 * We published ->resize_in_progress = true with the unlock above,
> +	 * which makes new arrivals bail to locked operation.
> +	 *
> +	 * Now we only need to wait for CPUs which did not observe the flag to
> +	 * leave and make sure their store to the fd table got published.
> +	 *
> +	 * We do it with synchronize_rcu(), which both waits for all sections to
> +	 * finish (taking care of the first part) and guarantees all CPUs issued a
> +	 * full fence (taking care of the second part).
> +	 *
> +	 * Note we know there is nobody to wait for if we are dealing with a
> +	 * single-threaded process.
>  	 */
>  	if (atomic_read(&files->count) > 1)
>  		synchronize_rcu();
>  
>  	spin_lock(&files->file_lock);
> -	if (IS_ERR(new_fdt))
> -		return PTR_ERR(new_fdt);
> +	if (IS_ERR(new_fdt)) {
> +		err = PTR_ERR(new_fdt);
> +		goto out;
> +	}
>  	cur_fdt = files_fdtable(files);
>  	BUG_ON(nr < cur_fdt->max_fds);
>  	copy_fdtable(new_fdt, cur_fdt);
>  	rcu_assign_pointer(files->fdt, new_fdt);
>  	if (cur_fdt != &files->fdtab)
>  		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> -	/* coupled with smp_rmb() in fd_install() */
> +
> +	/*
> +	 * Publish everything before we unset ->resize_in_progress, see above
> +	 * for an explanation.
> +	 */
>  	smp_wmb();
> -	return 0;
> +out:
> +	files->resize_in_progress = false;
> +	return err;
>  }
>  
>  /*
> @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr)
>  		return -EMFILE;
>  
>  	/* All good, so we try */
> -	files->resize_in_progress = true;
>  	error = expand_fdtable(files, nr);
> -	files->resize_in_progress = false;
>  
>  	wake_up_all(&files->resize_wait);
>  	return error;
> @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd);
>  
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);
>  
>  	if (unlikely(files->resize_in_progress)) {
>  		rcu_read_unlock_sched();
> @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		spin_unlock(&files->file_lock);
>  		return;
>  	}
> -	/* coupled with smp_wmb() in expand_fdtable() */
> -	smp_rmb();
> +
>  	fdt = rcu_dereference_sched(files->fdt);
>  	BUG_ON(fdt->fd[fd] != NULL);
>  	rcu_assign_pointer(fdt->fd[fd], file);
> -- 
> 2.43.0
>

Mateusz Guzik Dec. 5, 2024, 3:01 p.m. UTC | #5

On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> > See the added commentary for reasoning.
> >
> > ->resize_in_progress handling is moved inside of expand_fdtable() for
> > clarity.
> >
> > Whacks an actual fence on arm64.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>
> Hum, I don't think this works. What could happen now is:
>
> CPU1                                    CPU2
> expand_fdtable()                        fd_install()
>   files->resize_in_progress = true;
>   ...
>   if (atomic_read(&files->count) > 1)
>     synchronize_rcu();
>   ...
>   rcu_assign_pointer(files->fdt, new_fdt);
>   if (cur_fdt != &files->fdtab)
>           call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
>
>                                         rcu_read_lock_sched()
>
>                                         fdt = rcu_dereference_sched(files->fdt);
>                                         /* Fetched old FD table - without
>                                          * smp_rmb() the read was reordered */
>   rcu_assign_pointer(files->fdt, new_fdt);
>   /*
>    * Publish everything before we unset ->resize_in_progress, see above
>    * for an explanation.
>    */
>   smp_wmb();
> out:
>   files->resize_in_progress = false;
>                                         if (unlikely(files->resize_in_progress)) {
>                                           - false
>                                         rcu_assign_pointer(fdt->fd[fd], file);
>                                           - store in the old table - boom.
>

I don't believe this ordering is possible because of both
synchronize_rcu and the fence before updating resize_in_progress.

Any CPU which could try racing like that had to come in after the
synchronize_rcu() call, meaning one of the 3 possibilities:
- the flag is true and the fd table is old
- the flag is true and the fd table is new
- the flag is false and the fd table is new

Suppose the CPU reordered loads of the flag and the fd table. There is
no ordering in which it can see both the old table and the unset flag.


--
Mateusz Guzik <mjguzik gmail.com>

Mateusz Guzik Dec. 5, 2024, 3:06 p.m. UTC | #6

On Thu, Dec 5, 2024 at 3:58 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > +     BUG_ON(files->resize_in_progress);
>
> I think this BUG_ON() here is a bit unnecessary.
>

I'm an assert-friendly guy, but I'm not going to argue about this one.

> > +     files->resize_in_progress = true;
>
> Minor: Why move that into expand_fdtable()? Having
> files->resize_in_progress in here doesn't add much more clarity than
> just having it set around expand_fdtable() in the caller.
>
> Leaving it there also makes the patch smaller and clearer to follow as
> you neither need the additional err nor the goto.

The is indeed not the best looking goto in the world, but per the
commit message I moved the flag in so that it can be easier referred
to when describing synchronisation against fd_install.

Jan Kara Dec. 5, 2024, 3:29 p.m. UTC | #7

On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> > > See the added commentary for reasoning.
> > >
> > > ->resize_in_progress handling is moved inside of expand_fdtable() for
> > > clarity.
> > >
> > > Whacks an actual fence on arm64.
> > >
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> >
> > Hum, I don't think this works. What could happen now is:
> >
> > CPU1                                    CPU2
> > expand_fdtable()                        fd_install()
> >   files->resize_in_progress = true;
> >   ...
> >   if (atomic_read(&files->count) > 1)
> >     synchronize_rcu();
> >   ...
> >   rcu_assign_pointer(files->fdt, new_fdt);
> >   if (cur_fdt != &files->fdtab)
> >           call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> >
> >                                         rcu_read_lock_sched()
> >
> >                                         fdt = rcu_dereference_sched(files->fdt);
> >                                         /* Fetched old FD table - without
> >                                          * smp_rmb() the read was reordered */
> >   rcu_assign_pointer(files->fdt, new_fdt);
> >   /*
> >    * Publish everything before we unset ->resize_in_progress, see above
> >    * for an explanation.
> >    */
> >   smp_wmb();
> > out:
> >   files->resize_in_progress = false;
> >                                         if (unlikely(files->resize_in_progress)) {
> >                                           - false
> >                                         rcu_assign_pointer(fdt->fd[fd], file);
> >                                           - store in the old table - boom.
> >
> 
> I don't believe this ordering is possible because of both
> synchronize_rcu and the fence before updating resize_in_progress.
> 
> Any CPU which could try racing like that had to come in after the
> synchronize_rcu() call, meaning one of the 3 possibilities:
> - the flag is true and the fd table is old
> - the flag is true and the fd table is new
> - the flag is false and the fd table is new

I agree here.

> Suppose the CPU reordered loads of the flag and the fd table. There is
> no ordering in which it can see both the old table and the unset flag.

But I disagree here. If the reads are reordered, then the fd table read can
happen during the "flag is true and the fd table is old" state and the flag
read can happen later in "flag is false and the fd table is new" state.
Just as I outlined above...

								Honza

Mateusz Guzik Dec. 5, 2024, 3:36 p.m. UTC | #8

On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> > Suppose the CPU reordered loads of the flag and the fd table. There is
> > no ordering in which it can see both the old table and the unset flag.
>
> But I disagree here. If the reads are reordered, then the fd table read can
> happen during the "flag is true and the fd table is old" state and the flag
> read can happen later in "flag is false and the fd table is new" state.
> Just as I outlined above...

In your example all the work happens *after* synchronize_rcu(). The
thread resizing the table already published the result even before
calling into it. Furthermore by the time synchronize_rcu returns
everyone is guaranteed to have issued a full fence. Meaning nobody can
see the flag as unset.

Paul E. McKenney Dec. 5, 2024, 6:41 p.m. UTC | #9

On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > >  void fd_install(unsigned int fd, struct file *file)
> > >  {
> > > -     struct files_struct *files = current->files;
> > > +     struct files_struct *files;
> > >       struct fdtable *fdt;
> > >
> > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > >               return;
> > >
> > > +     /*
> > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > +      * explanation.
> > > +      */
> > >       rcu_read_lock_sched();
> > > +     files = READ_ONCE(current->files);
> >
> > What are you trying to do with that READ_ONCE()?  current->files
> > itself is *not* changed by any of that code; current->files->fdtab is.
> 
> To my understanding this is the idiomatic way of spelling out the
> non-existent in Linux smp_consume_load, for the resize_in_progress
> flag.

In Linus, "smp_consume_load()" is named rcu_dereference().

> Anyway to elaborate I'm gunning for a setup where the code is
> semantically equivalent to having a lock around the work.

Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
only with later RCU grace periods, such as those implemented by
synchronize_rcu().

> Pretend ->resize_lock exists, then:
> fd_install:
> files = current->files;
> read_lock(files->resize_lock);
> fdt = rcu_dereference_sched(files->fdt);
> rcu_assign_pointer(fdt->fd[fd], file);
> read_unlock(files->resize_lock);
> 
> expand_fdtable:
> write_lock(files->resize_lock);
> [snip]
> rcu_assign_pointer(files->fdt, new_fdt);
> write_unlock(files->resize_lock);
> 
> Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> synchronize_rcu do it.

OK, good, you did get the grace-period part of the puzzle.

Howver, please keep in mind that synchronize_rcu() has significant
latency by design.  There is a tradeoff between CPU consumption and
latency, and synchronize_rcu() therefore has latencies ranging upwards of
several milliseconds (not microseconds or nanoseconds).  I would be very
surprised if expand_fdtable() users would be happy with such a long delay.
Or are you using some trick to hide this delay?

Mateusz Guzik Dec. 5, 2024, 7:03 p.m. UTC | #10

On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > >  void fd_install(unsigned int fd, struct file *file)
> > > >  {
> > > > -     struct files_struct *files = current->files;
> > > > +     struct files_struct *files;
> > > >       struct fdtable *fdt;
> > > >
> > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > >               return;
> > > >
> > > > +     /*
> > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > +      * explanation.
> > > > +      */
> > > >       rcu_read_lock_sched();
> > > > +     files = READ_ONCE(current->files);
> > >
> > > What are you trying to do with that READ_ONCE()?  current->files
> > > itself is *not* changed by any of that code; current->files->fdtab is.
> >
> > To my understanding this is the idiomatic way of spelling out the
> > non-existent in Linux smp_consume_load, for the resize_in_progress
> > flag.
>
> In Linus, "smp_consume_load()" is named rcu_dereference().
>

ok

> > Anyway to elaborate I'm gunning for a setup where the code is
> > semantically equivalent to having a lock around the work.
>
> Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> only with later RCU grace periods, such as those implemented by
> synchronize_rcu().
>

To my understanding the pre-case is already with the flag set upfront
and waiting for everyone to finish (which is already taking place in
stock code) + looking at it within the section.

> > Pretend ->resize_lock exists, then:
> > fd_install:
> > files = current->files;
> > read_lock(files->resize_lock);
> > fdt = rcu_dereference_sched(files->fdt);
> > rcu_assign_pointer(fdt->fd[fd], file);
> > read_unlock(files->resize_lock);
> >
> > expand_fdtable:
> > write_lock(files->resize_lock);
> > [snip]
> > rcu_assign_pointer(files->fdt, new_fdt);
> > write_unlock(files->resize_lock);
> >
> > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > synchronize_rcu do it.
>
> OK, good, you did get the grace-period part of the puzzle.
>
> Howver, please keep in mind that synchronize_rcu() has significant
> latency by design.  There is a tradeoff between CPU consumption and
> latency, and synchronize_rcu() therefore has latencies ranging upwards of
> several milliseconds (not microseconds or nanoseconds).  I would be very
> surprised if expand_fdtable() users would be happy with such a long delay.

The call is already there since 2015 and I only know of one case where
someone took an issue with it (and it could have been sorted out with
dup2 upfront to grow the table to the desired size). Amusingly I see
you patched it in 2018 from synchronize_sched to synchronize_rcu.
Bottom line though is that I'm not *adding* it. latency here. :)

So assuming the above can be ignored, do you confirm the patch works
(even if it needs some cosmetic changes)?

The entirety of the patch is about removing smp_rmb in fd_install with
small code rearrangement, while relying on the machinery which is
already there.

Linus Torvalds Dec. 5, 2024, 7:26 p.m. UTC | #11

On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> > To my understanding this is the idiomatic way of spelling out the
> > non-existent in Linux smp_consume_load, for the resize_in_progress
> > flag.
>
> In Linus, "smp_consume_load()" is named rcu_dereference().

Linux.

But yes and no.

It's worth making it really really clear that "rcu_dereference()" is
*not* just a different name for some "smp_consume_load()" operation.

Why? Because a true smp_consume_load() would work with any random kind
of flags etc. And rcu_dereference() works only because it's a pointer,
and there's an inherent data dependency to what the result points to.

Paul obviously knows this, but let's make it very clear in this
discussion, because if somebody decided "I want a smp_consume_load(),
and I'll use rcu_dereference() to do that", the end result would
simply not work for arbitrary data, like a flags field or something,
where comparing it against a value will only result in a control
dependency, not an actual data dependency.

             Linus

Mateusz Guzik Dec. 5, 2024, 7:47 p.m. UTC | #12

On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
>
> Linux.
>
> But yes and no.
>
> It's worth making it really really clear that "rcu_dereference()" is
> *not* just a different name for some "smp_consume_load()" operation.
>
> Why? Because a true smp_consume_load() would work with any random kind
> of flags etc. And rcu_dereference() works only because it's a pointer,
> and there's an inherent data dependency to what the result points to.
>
> Paul obviously knows this, but let's make it very clear in this
> discussion, because if somebody decided "I want a smp_consume_load(),
> and I'll use rcu_dereference() to do that", the end result would
> simply not work for arbitrary data, like a flags field or something,
> where comparing it against a value will only result in a control
> dependency, not an actual data dependency.
>

So I checked for kicks and rcu_dereference comes with type checking,
as in passing something which is not a pointer even fails to compile.

I'll note thought that a smp_load_consume_ptr or similarly named
routine would be nice and I'm rather confused why it was not added
given smp_load_acquire and smp_store_release being there.

One immediate user would be mnt_idmap(), like so:
iff --git a/include/linux/mount.h b/include/linux/mount.h
index 33f17b6e8732..4d3486ff67ed 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -76,7 +76,7 @@ struct vfsmount {
 static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
 {
        /* Pairs with smp_store_release() in do_idmap_mount(). */
-       return READ_ONCE(mnt->mnt_idmap);
+       return smp_load_consume_ptr(mnt->mnt_idmap);
 }

 extern int mnt_want_write(struct vfsmount *mnt);

Paul E. McKenney Dec. 5, 2024, 8:01 p.m. UTC | #13

On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > >  {
> > > > > -     struct files_struct *files = current->files;
> > > > > +     struct files_struct *files;
> > > > >       struct fdtable *fdt;
> > > > >
> > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > >               return;
> > > > >
> > > > > +     /*
> > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > +      * explanation.
> > > > > +      */
> > > > >       rcu_read_lock_sched();
> > > > > +     files = READ_ONCE(current->files);
> > > >
> > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
> 
> ok

And rcu_dereference(), and for that matter memory_order_consume, only
orders the load of the pointer against subsequent dereferences of that
same pointer against dereferences of that same pointer preceding the
store of that pointer.

	T1				T2
	a: p->a = 1;			d: q = rcu_dereference(gp);
	b: r1 = p->b;			e: r2 = p->a;
	c: rcu_assign_pointer(gp, p);	f: p->b = 42;

Here, if (and only if!) T2's load into q gets the value stored by
T1, then T1's statements e and f are guaranteed to happen after T2's
statements a and b.  In your patch, I do not see this pattern for the
files->resize_in_progress flag.

> > > Anyway to elaborate I'm gunning for a setup where the code is
> > > semantically equivalent to having a lock around the work.
> >
> > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > only with later RCU grace periods, such as those implemented by
> > synchronize_rcu().
> 
> To my understanding the pre-case is already with the flag set upfront
> and waiting for everyone to finish (which is already taking place in
> stock code) + looking at it within the section.

I freely confess that I do not understand the purpose of assigning to
files->resize_in_progress both before (pre-existing) and within (added)
expand_fdtable().  If the assignments before and after the call to
expand_fdtable() and the checks were under that lock, that could work,
but removing that lockless check might have performance and scalability
consequences.

> > > Pretend ->resize_lock exists, then:
> > > fd_install:
> > > files = current->files;
> > > read_lock(files->resize_lock);
> > > fdt = rcu_dereference_sched(files->fdt);
> > > rcu_assign_pointer(fdt->fd[fd], file);
> > > read_unlock(files->resize_lock);
> > >
> > > expand_fdtable:
> > > write_lock(files->resize_lock);
> > > [snip]
> > > rcu_assign_pointer(files->fdt, new_fdt);
> > > write_unlock(files->resize_lock);
> > >
> > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > synchronize_rcu do it.
> >
> > OK, good, you did get the grace-period part of the puzzle.
> >
> > Howver, please keep in mind that synchronize_rcu() has significant
> > latency by design.  There is a tradeoff between CPU consumption and
> > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > several milliseconds (not microseconds or nanoseconds).  I would be very
> > surprised if expand_fdtable() users would be happy with such a long delay.
> 
> The call is already there since 2015 and I only know of one case where
> someone took an issue with it (and it could have been sorted out with
> dup2 upfront to grow the table to the desired size). Amusingly I see
> you patched it in 2018 from synchronize_sched to synchronize_rcu.
> Bottom line though is that I'm not *adding* it. latency here. :)

Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
you are saying that, because otherwise your patch could simply remove
it without additional code changes.  On the other hand, if it is a key
component of the synchronization, I don't see how that smp_rmb() can be
removed while still preserving that synchronization without adding another
synchronize_rcu() to that function to compensate.

Now, it might be that you are somehow cleverly reusing the pre-existing
synchronize_rcu(), but I am not immediately seeing how this would work.

And no, I do not recall making that particular change back in the
day, only that I did change all the calls to synchronize_sched() to
synchronize_rcu().  Please accept my apologies for my having failed
to meet your expectations.  And do not be too surprised if others have
similar expectations of you at some point in the future.  ;-)

> So assuming the above can be ignored, do you confirm the patch works
> (even if it needs some cosmetic changes)?
> 
> The entirety of the patch is about removing smp_rmb in fd_install with
> small code rearrangement, while relying on the machinery which is
> already there.

The code to be synchronized is fairly small.  So why don't you
create a litmus test and ask herd7?  Please see tools/memory-model for
documentation and other example litmus tests.  This tool does the moral
equivalent of a full state-space search of the litmus tests, telling you
whether your "exists" condition is always, sometimes, or never satisfied.

							Thnax, Paul

Paul E. McKenney Dec. 5, 2024, 8:06 p.m. UTC | #14

On Thu, Dec 05, 2024 at 11:26:35AM -0800, Linus Torvalds wrote:
> On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
> 
> Linux.

One of those days...  ;-)

> But yes and no.
> 
> It's worth making it really really clear that "rcu_dereference()" is
> *not* just a different name for some "smp_consume_load()" operation.
> 
> Why? Because a true smp_consume_load() would work with any random kind
> of flags etc. And rcu_dereference() works only because it's a pointer,
> and there's an inherent data dependency to what the result points to.
> 
> Paul obviously knows this, but let's make it very clear in this
> discussion, because if somebody decided "I want a smp_consume_load(),
> and I'll use rcu_dereference() to do that", the end result would
> simply not work for arbitrary data, like a flags field or something,
> where comparing it against a value will only result in a control
> dependency, not an actual data dependency.

Fair points!

And Linus (and Linux, for that matter) equally obviously already knows
this, but please note also that an smp_load_consume() would still order
only later dereferences of the thing returned from smp_load_consume(),
which means that it pretty much needs to be a pointer.  (Yes, in theory,
it could be an array index, but in practice compilers know way too much
about integer arithmetic for this to be advisable.)

							Thanx, Paul

Paul E. McKenney Dec. 5, 2024, 8:11 p.m. UTC | #15

On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > > To my understanding this is the idiomatic way of spelling out the
> > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > flag.
> > >
> > > In Linus, "smp_consume_load()" is named rcu_dereference().
> >
> > Linux.
> >
> > But yes and no.
> >
> > It's worth making it really really clear that "rcu_dereference()" is
> > *not* just a different name for some "smp_consume_load()" operation.
> >
> > Why? Because a true smp_consume_load() would work with any random kind
> > of flags etc. And rcu_dereference() works only because it's a pointer,
> > and there's an inherent data dependency to what the result points to.
> >
> > Paul obviously knows this, but let's make it very clear in this
> > discussion, because if somebody decided "I want a smp_consume_load(),
> > and I'll use rcu_dereference() to do that", the end result would
> > simply not work for arbitrary data, like a flags field or something,
> > where comparing it against a value will only result in a control
> > dependency, not an actual data dependency.
> 
> So I checked for kicks and rcu_dereference comes with type checking,
> as in passing something which is not a pointer even fails to compile.

That is by design, keeping in mind that consume loads order only
later dereferences against the pointer load.

> I'll note thought that a smp_load_consume_ptr or similarly named
> routine would be nice and I'm rather confused why it was not added
> given smp_load_acquire and smp_store_release being there.

In recent kernels, READ_ONCE() is your smp_load_consume_ptr().  Or things
like rcu_dereference_check(p, 1).  But these can be used only when the
pointed-to object is guaranteed to live (as in not be freed) for the
full duration of the read-side use of that pointer.

> One immediate user would be mnt_idmap(), like so:
> iff --git a/include/linux/mount.h b/include/linux/mount.h
> index 33f17b6e8732..4d3486ff67ed 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -76,7 +76,7 @@ struct vfsmount {
>  static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
>  {
>         /* Pairs with smp_store_release() in do_idmap_mount(). */
> -       return READ_ONCE(mnt->mnt_idmap);
> +       return smp_load_consume_ptr(mnt->mnt_idmap);
>  }
> 
>  extern int mnt_want_write(struct vfsmount *mnt);

These would have the same semantics.  And in v6.12, this is instead
smp_load_acquire().

							Thanx, Paul

Mateusz Guzik Dec. 5, 2024, 8:15 p.m. UTC | #16

On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > >
> > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > > >  {
> > > > > > -     struct files_struct *files = current->files;
> > > > > > +     struct files_struct *files;
> > > > > >       struct fdtable *fdt;
> > > > > >
> > > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > >               return;
> > > > > >
> > > > > > +     /*
> > > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > > +      * explanation.
> > > > > > +      */
> > > > > >       rcu_read_lock_sched();
> > > > > > +     files = READ_ONCE(current->files);
> > > > >
> > > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > >
> > > > To my understanding this is the idiomatic way of spelling out the
> > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > flag.
> > >
> > > In Linus, "smp_consume_load()" is named rcu_dereference().
> >
> > ok
>
> And rcu_dereference(), and for that matter memory_order_consume, only
> orders the load of the pointer against subsequent dereferences of that
> same pointer against dereferences of that same pointer preceding the
> store of that pointer.
>
>         T1                              T2
>         a: p->a = 1;                    d: q = rcu_dereference(gp);
>         b: r1 = p->b;                   e: r2 = p->a;
>         c: rcu_assign_pointer(gp, p);   f: p->b = 42;
>
> Here, if (and only if!) T2's load into q gets the value stored by
> T1, then T1's statements e and f are guaranteed to happen after T2's
> statements a and b.  In your patch, I do not see this pattern for the
> files->resize_in_progress flag.
>
> > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > semantically equivalent to having a lock around the work.
> > >
> > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > only with later RCU grace periods, such as those implemented by
> > > synchronize_rcu().
> >
> > To my understanding the pre-case is already with the flag set upfront
> > and waiting for everyone to finish (which is already taking place in
> > stock code) + looking at it within the section.
>
> I freely confess that I do not understand the purpose of assigning to
> files->resize_in_progress both before (pre-existing) and within (added)
> expand_fdtable().  If the assignments before and after the call to
> expand_fdtable() and the checks were under that lock, that could work,
> but removing that lockless check might have performance and scalability
> consequences.
>
> > > > Pretend ->resize_lock exists, then:
> > > > fd_install:
> > > > files = current->files;
> > > > read_lock(files->resize_lock);
> > > > fdt = rcu_dereference_sched(files->fdt);
> > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > read_unlock(files->resize_lock);
> > > >
> > > > expand_fdtable:
> > > > write_lock(files->resize_lock);
> > > > [snip]
> > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > write_unlock(files->resize_lock);
> > > >
> > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > synchronize_rcu do it.
> > >
> > > OK, good, you did get the grace-period part of the puzzle.
> > >
> > > Howver, please keep in mind that synchronize_rcu() has significant
> > > latency by design.  There is a tradeoff between CPU consumption and
> > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > several milliseconds (not microseconds or nanoseconds).  I would be very
> > > surprised if expand_fdtable() users would be happy with such a long delay.
> >
> > The call is already there since 2015 and I only know of one case where
> > someone took an issue with it (and it could have been sorted out with
> > dup2 upfront to grow the table to the desired size). Amusingly I see
> > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > Bottom line though is that I'm not *adding* it. latency here. :)
>
> Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
> you are saying that, because otherwise your patch could simply remove
> it without additional code changes.  On the other hand, if it is a key
> component of the synchronization, I don't see how that smp_rmb() can be
> removed while still preserving that synchronization without adding another
> synchronize_rcu() to that function to compensate.
>
> Now, it might be that you are somehow cleverly reusing the pre-existing
> synchronize_rcu(), but I am not immediately seeing how this would work.
>
> And no, I do not recall making that particular change back in the
> day, only that I did change all the calls to synchronize_sched() to
> synchronize_rcu().  Please accept my apologies for my having failed
> to meet your expectations.  And do not be too surprised if others have
> similar expectations of you at some point in the future.  ;-)
>
> > So assuming the above can be ignored, do you confirm the patch works
> > (even if it needs some cosmetic changes)?
> >
> > The entirety of the patch is about removing smp_rmb in fd_install with
> > small code rearrangement, while relying on the machinery which is
> > already there.
>
> The code to be synchronized is fairly small.  So why don't you
> create a litmus test and ask herd7?  Please see tools/memory-model for
> documentation and other example litmus tests.  This tool does the moral
> equivalent of a full state-space search of the litmus tests, telling you
> whether your "exists" condition is always, sometimes, or never satisfied.
>

I think there is quite a degree of talking past each other in this thread.

I was not aware of herd7. Testing the thing with it sounds like a plan
to get out of it, so I'm going to do it and get back to you in a day
or two. Worst case the patch is a bust, best case the fence is already
of no use.

Paul E. McKenney Dec. 5, 2024, 9:17 p.m. UTC | #17

On Thu, Dec 05, 2024 at 09:15:14PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > > >
> > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > > > >  {
> > > > > > > -     struct files_struct *files = current->files;
> > > > > > > +     struct files_struct *files;
> > > > > > >       struct fdtable *fdt;
> > > > > > >
> > > > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > > >               return;
> > > > > > >
> > > > > > > +     /*
> > > > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > > > +      * explanation.
> > > > > > > +      */
> > > > > > >       rcu_read_lock_sched();
> > > > > > > +     files = READ_ONCE(current->files);
> > > > > >
> > > > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > > >
> > > > > To my understanding this is the idiomatic way of spelling out the
> > > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > > flag.
> > > >
> > > > In Linus, "smp_consume_load()" is named rcu_dereference().
> > >
> > > ok
> >
> > And rcu_dereference(), and for that matter memory_order_consume, only
> > orders the load of the pointer against subsequent dereferences of that
> > same pointer against dereferences of that same pointer preceding the
> > store of that pointer.
> >
> >         T1                              T2
> >         a: p->a = 1;                    d: q = rcu_dereference(gp);
> >         b: r1 = p->b;                   e: r2 = p->a;
> >         c: rcu_assign_pointer(gp, p);   f: p->b = 42;
> >
> > Here, if (and only if!) T2's load into q gets the value stored by
> > T1, then T1's statements e and f are guaranteed to happen after T2's
> > statements a and b.  In your patch, I do not see this pattern for the
> > files->resize_in_progress flag.
> >
> > > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > > semantically equivalent to having a lock around the work.
> > > >
> > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > > only with later RCU grace periods, such as those implemented by
> > > > synchronize_rcu().
> > >
> > > To my understanding the pre-case is already with the flag set upfront
> > > and waiting for everyone to finish (which is already taking place in
> > > stock code) + looking at it within the section.
> >
> > I freely confess that I do not understand the purpose of assigning to
> > files->resize_in_progress both before (pre-existing) and within (added)
> > expand_fdtable().  If the assignments before and after the call to
> > expand_fdtable() and the checks were under that lock, that could work,
> > but removing that lockless check might have performance and scalability
> > consequences.
> >
> > > > > Pretend ->resize_lock exists, then:
> > > > > fd_install:
> > > > > files = current->files;
> > > > > read_lock(files->resize_lock);
> > > > > fdt = rcu_dereference_sched(files->fdt);
> > > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > > read_unlock(files->resize_lock);
> > > > >
> > > > > expand_fdtable:
> > > > > write_lock(files->resize_lock);
> > > > > [snip]
> > > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > > write_unlock(files->resize_lock);
> > > > >
> > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > > synchronize_rcu do it.
> > > >
> > > > OK, good, you did get the grace-period part of the puzzle.
> > > >
> > > > Howver, please keep in mind that synchronize_rcu() has significant
> > > > latency by design.  There is a tradeoff between CPU consumption and
> > > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > > several milliseconds (not microseconds or nanoseconds).  I would be very
> > > > surprised if expand_fdtable() users would be happy with such a long delay.
> > >
> > > The call is already there since 2015 and I only know of one case where
> > > someone took an issue with it (and it could have been sorted out with
> > > dup2 upfront to grow the table to the desired size). Amusingly I see
> > > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > > Bottom line though is that I'm not *adding* it. latency here. :)
> >
> > Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
> > you are saying that, because otherwise your patch could simply remove
> > it without additional code changes.  On the other hand, if it is a key
> > component of the synchronization, I don't see how that smp_rmb() can be
> > removed while still preserving that synchronization without adding another
> > synchronize_rcu() to that function to compensate.
> >
> > Now, it might be that you are somehow cleverly reusing the pre-existing
> > synchronize_rcu(), but I am not immediately seeing how this would work.
> >
> > And no, I do not recall making that particular change back in the
> > day, only that I did change all the calls to synchronize_sched() to
> > synchronize_rcu().  Please accept my apologies for my having failed
> > to meet your expectations.  And do not be too surprised if others have
> > similar expectations of you at some point in the future.  ;-)
> >
> > > So assuming the above can be ignored, do you confirm the patch works
> > > (even if it needs some cosmetic changes)?
> > >
> > > The entirety of the patch is about removing smp_rmb in fd_install with
> > > small code rearrangement, while relying on the machinery which is
> > > already there.
> >
> > The code to be synchronized is fairly small.  So why don't you
> > create a litmus test and ask herd7?  Please see tools/memory-model for
> > documentation and other example litmus tests.  This tool does the moral
> > equivalent of a full state-space search of the litmus tests, telling you
> > whether your "exists" condition is always, sometimes, or never satisfied.
> >
> 
> I think there is quite a degree of talking past each other in this thread.
> 
> I was not aware of herd7. Testing the thing with it sounds like a plan
> to get out of it, so I'm going to do it and get back to you in a day
> or two. Worst case the patch is a bust, best case the fence is already
> of no use.

Very good!  My grouchiness earlier in this thread notwithstanding, I am
happy to review your litmus tests.  (You will likely need more than one.)

And the inevitable unsolicited advice:  Make those litmus tests as small
as you possibly can.  Full-state-space search is extremely powerful,
but it does not scale very well.

							Thanx, Paul

Christian Brauner Dec. 6, 2024, 12:11 p.m. UTC | #18

On Thu, Dec 05, 2024 at 12:11:57PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > > To my understanding this is the idiomatic way of spelling out the
> > > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > > flag.
> > > >
> > > > In Linus, "smp_consume_load()" is named rcu_dereference().
> > >
> > > Linux.
> > >
> > > But yes and no.
> > >
> > > It's worth making it really really clear that "rcu_dereference()" is
> > > *not* just a different name for some "smp_consume_load()" operation.
> > >
> > > Why? Because a true smp_consume_load() would work with any random kind
> > > of flags etc. And rcu_dereference() works only because it's a pointer,
> > > and there's an inherent data dependency to what the result points to.
> > >
> > > Paul obviously knows this, but let's make it very clear in this
> > > discussion, because if somebody decided "I want a smp_consume_load(),
> > > and I'll use rcu_dereference() to do that", the end result would
> > > simply not work for arbitrary data, like a flags field or something,
> > > where comparing it against a value will only result in a control
> > > dependency, not an actual data dependency.
> > 
> > So I checked for kicks and rcu_dereference comes with type checking,
> > as in passing something which is not a pointer even fails to compile.
> 
> That is by design, keeping in mind that consume loads order only
> later dereferences against the pointer load.
> 
> > I'll note thought that a smp_load_consume_ptr or similarly named
> > routine would be nice and I'm rather confused why it was not added
> > given smp_load_acquire and smp_store_release being there.
> 
> In recent kernels, READ_ONCE() is your smp_load_consume_ptr().  Or things
> like rcu_dereference_check(p, 1).  But these can be used only when the
> pointed-to object is guaranteed to live (as in not be freed) for the
> full duration of the read-side use of that pointer.

Both true in the case of mnt_idmap(). All mounts start with mnt_idmap set to:
extern struct mnt_idmap nop_mnt_idmap which doesn't go away ever. And we
only allow to change the idmaping of a mount once. 
So the READ_ONCE() will always retrieve an object that is guaranteed to
stay alive for at least as long as the mount stays alive (in the
nop_mnt_idmap case obviously "forever").

I wanted to avoid a) pushing complicated RCU dances all through
filesystems and the VFS and b) taking any reference counts whatsoever on
mnt_idmap (other than sharing the same mnt_idmap between different
mounts at creation time). (Though I do have long-standing ideas how that
would work without changing the mnt_idmap pointer.).

Jan Kara Dec. 6, 2024, 3:32 p.m. UTC | #19

On Thu 05-12-24 16:36:40, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> > > Suppose the CPU reordered loads of the flag and the fd table. There is
> > > no ordering in which it can see both the old table and the unset flag.
> >
> > But I disagree here. If the reads are reordered, then the fd table read can
> > happen during the "flag is true and the fd table is old" state and the flag
> > read can happen later in "flag is false and the fd table is new" state.
> > Just as I outlined above...

Ugh, I might be missing something obvious so please bear with me.

> In your example all the work happens *after* synchronize_rcu().

Correct.

> The thread resizing the table already published the result even before
> calling into it.

Really? You proposed expand_table() does:

       BUG_ON(files->resize_in_progress);
       files->resize_in_progress = true;
       spin_unlock(&files->file_lock);
       new_fdt = alloc_fdtable(nr + 1);
       if (atomic_read(&files->count) > 1)
               synchronize_rcu();

       spin_lock(&files->file_lock);
       if (IS_ERR(new_fdt)) {
               err = PTR_ERR(new_fdt);
               goto out;
       }
       cur_fdt = files_fdtable(files);
       BUG_ON(nr < cur_fdt->max_fds);
       copy_fdtable(new_fdt, cur_fdt);
       rcu_assign_pointer(files->fdt, new_fdt);
       if (cur_fdt != &files->fdtab)
               call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
       smp_wmb();
out:
       files->resize_in_progress = false;
       return err;

So synchronize_rcu() is called very early AFAICT. At that point we have
allocated the new table but copy & store in files->fdt happens *after*
synchronize_rcu() has finished. So the copy & store can be racing with
fd_install() calling rcu_read_lock_sched() and prefetching files->fdt (but
not files->resize_in_progress!) into a local CPU cache.

> Furthermore by the time synchronize_rcu returns
> everyone is guaranteed to have issued a full fence. Meaning nobody can
> see the flag as unset.

Well, nobody can see the flag unset only until expand_fdtable() reaches:

       smp_wmb();
out:
       files->resize_in_progress = false;
 
And smp_wmb() doesn't give you much unless the reads of
files->resize_in_progress and files->fdt are ordered somehow on the other
side (i.e., in fd_install()).

But I'm looking forward to the Litmus test to resolve our discussion :)

								Honza

[RFC] fs: elide the smp_rmb fence in fd_install()

Commit Message

Comments

Patch