diff mbox series

[2/2] xfs: make sure link path does not go away at access

Message ID 163660197073.22525.11235124150551283676.stgit@mickey.themaw.net (mailing list archive)
State New, archived
Headers show
Series [1/2] vfs: check dentry is still valid in get_link() | expand

Commit Message

Ian Kent Nov. 11, 2021, 3:39 a.m. UTC
When following a trailing symlink in rcu-walk mode it's possible to
succeed in getting the ->get_link() method pointer but the link path
string be deallocated while it's being used.

Utilize the rcu mechanism to mitigate this risk.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/xfs/kmem.h      |    4 ++++
 fs/xfs/xfs_inode.c |    4 ++--
 fs/xfs/xfs_iops.c  |   10 ++++++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

Comments

Brian Foster Nov. 11, 2021, 4:08 p.m. UTC | #1
On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> When following a trailing symlink in rcu-walk mode it's possible to
> succeed in getting the ->get_link() method pointer but the link path
> string be deallocated while it's being used.
> 
> Utilize the rcu mechanism to mitigate this risk.
> 
> Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/xfs/kmem.h      |    4 ++++
>  fs/xfs/xfs_inode.c |    4 ++--
>  fs/xfs/xfs_iops.c  |   10 ++++++++--
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a607d6aca5c4..2977e19da7b7 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -524,11 +524,17 @@ xfs_vn_get_link_inline(
>  
>  	/*
>  	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
> -	 * if_data is junk.
> +	 * if_data is junk. Also, if the path walk is in rcu-walk mode
> +	 * and the inode link path has gone away due inode re-use we have
> +	 * no choice but to tell the VFS to redo the lookup.
>  	 */
> -	link = ip->i_df.if_u1.if_data;
> +	link = rcu_dereference(ip->i_df.if_u1.if_data);
> +	if (!dentry && !link)
> +		return ERR_PTR(-ECHILD);
> +

One thing that concerns me slightly about this approach is that inode
reuse does not necessarily guarantee that if_data is NULL. It seems
technically just as possible (even if exceedingly unlikely) for link to
point at newly allocated memory since the previous sequence count
validation check. The inode could be reused as another inline symlink
for example, though it's not clear to me if that is really a problem for
the vfs (assuming a restart would just land on the new link anyways?).
But the inode could also be reallocated as something like a shortform
directory, which means passing directory header data or whatever that it
stores in if_data back to pick_link(), which is then further processed
as a string.

With that, I wonder why we wouldn't just return -ECHILD here like we do
for the non-inline case to address the immediate problem, and then
perhaps separately consider if we can rework bits of the reuse/reclaim
code to allow rcu lookup of inline symlinks under certain conditions.

FWIW, I'm also a little curious why we don't set i_link for inline
symlinks. I don't think that addresses this validation problem, but
perhaps might allow rcu lookups in the inline symlink common case where
things don't change during the lookup (and maybe even eliminate the need
for this custom inline callback)..?

Brian

>  	if (XFS_IS_CORRUPT(ip->i_mount, !link))
>  		return ERR_PTR(-EFSCORRUPTED);
> +
>  	return link;
>  }
>  
> 
>
Ian Kent Nov. 11, 2021, 11:10 p.m. UTC | #2
On Thu, 2021-11-11 at 11:08 -0500, Brian Foster wrote:

Hi Brian,

> On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > When following a trailing symlink in rcu-walk mode it's possible to
> > succeed in getting the ->get_link() method pointer but the link
> > path
> > string be deallocated while it's being used.
> > 
> > Utilize the rcu mechanism to mitigate this risk.
> > 
> > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/xfs/kmem.h      |    4 ++++
> >  fs/xfs/xfs_inode.c |    4 ++--
> >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index a607d6aca5c4..2977e19da7b7 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -524,11 +524,17 @@ xfs_vn_get_link_inline(
> >  
> >         /*
> >          * The VFS crashes on a NULL pointer, so return -
> > EFSCORRUPTED if
> > -        * if_data is junk.
> > +        * if_data is junk. Also, if the path walk is in rcu-walk
> > mode
> > +        * and the inode link path has gone away due inode re-use
> > we have
> > +        * no choice but to tell the VFS to redo the lookup.
> >          */
> > -       link = ip->i_df.if_u1.if_data;
> > +       link = rcu_dereference(ip->i_df.if_u1.if_data);
> > +       if (!dentry && !link)
> > +               return ERR_PTR(-ECHILD);
> > +
> 
> One thing that concerns me slightly about this approach is that inode
> reuse does not necessarily guarantee that if_data is NULL. It seems
> technically just as possible (even if exceedingly unlikely) for link
> to
> point at newly allocated memory since the previous sequence count
> validation check. The inode could be reused as another inline symlink
> for example, though it's not clear to me if that is really a problem
> for
> the vfs (assuming a restart would just land on the new link
> anyways?).
> But the inode could also be reallocated as something like a shortform
> directory, which means passing directory header data or whatever that
> it
> stores in if_data back to pick_link(), which is then further
> processed
> as a string.

This is the sort of feedback I was hoping for.

This sounds related to the life-cycle of xfs inodes and re-use.
Hopefully someone here on the list can enlighten me on this.

The thing that comes to mind is that the inode re-use would
need to occur between the VFS check that validates the inode
is still ok and the use of link string. I think that can still
go away even with the above check.

Hopefully someone can clarify what happens here.

> 
> With that, I wonder why we wouldn't just return -ECHILD here like we
> do
> for the non-inline case to address the immediate problem, and then
> perhaps separately consider if we can rework bits of the
> reuse/reclaim
> code to allow rcu lookup of inline symlinks under certain conditions.

Always switching to ref-walk mode would certainly resolve the
problem too, yes, perhaps we have no choice ...

Ian
> 
> FWIW, I'm also a little curious why we don't set i_link for inline
> symlinks. I don't think that addresses this validation problem, but
> perhaps might allow rcu lookups in the inline symlink common case
> where
> things don't change during the lookup (and maybe even eliminate the
> need
> for this custom inline callback)..?
> 
> Brian
> 
> >         if (XFS_IS_CORRUPT(ip->i_mount, !link))
> >                 return ERR_PTR(-EFSCORRUPTED);
> > +
> >         return link;
> >  }
> >  
> > 
> > 
>
Dave Chinner Nov. 12, 2021, 12:32 a.m. UTC | #3
On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> When following a trailing symlink in rcu-walk mode it's possible to
> succeed in getting the ->get_link() method pointer but the link path
> string be deallocated while it's being used.
> 
> Utilize the rcu mechanism to mitigate this risk.
> 
> Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/xfs/kmem.h      |    4 ++++
>  fs/xfs/xfs_inode.c |    4 ++--
>  fs/xfs/xfs_iops.c  |   10 ++++++++--
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 54da6d717a06..c1bd1103b340 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -61,6 +61,10 @@ static inline void  kmem_free(const void *ptr)
>  {
>  	kvfree(ptr);
>  }
> +static inline void  kmem_free_rcu(const void *ptr)
> +{
> +	kvfree_rcu(ptr);
> +}
>  
>  
>  static inline void *
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a4f6f034fb81..aaa1911e61ed 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2650,8 +2650,8 @@ xfs_ifree(
>  	 * already been freed by xfs_attr_inactive.
>  	 */
>  	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> -		kmem_free(ip->i_df.if_u1.if_data);
> -		ip->i_df.if_u1.if_data = NULL;
> +		kmem_free_rcu(ip->i_df.if_u1.if_data);
> +		RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
>  		ip->i_df.if_bytes = 0;
>  	}

How do we get here in a way that the VFS will walk into this inode
during a lookup?

I mean, the dentry has to be validated and held during the RCU path
walk, so if we are running a transaction to mark the inode as free
here it has already been unlinked and the dentry turned
negative. So anything that is doing a lockless pathwalk onto that
dentry *should* see that it is a negative dentry at this point and
hence nothing should be walking any further or trying to access the
link that was shared from ->get_link().

AFAICT, that's what the sequence check bug you fixed in the previous
patch guarantees. It makes no difference if the unlinked inode has
been recycled or not, the lookup race condition is the same in that
the inode has gone through ->destroy_inode and is now owned by the
filesystem and not the VFS.

Otherwise, it might just be best to memset the buffer to zero here
rather than free it, and leave it to be freed when the inode is
freed from the RCU callback in xfs_inode_free_callback() as per
normal.

Cheers,

Dave.
Ian Kent Nov. 12, 2021, 1:26 a.m. UTC | #4
On Fri, 2021-11-12 at 11:32 +1100, Dave Chinner wrote:
> On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > When following a trailing symlink in rcu-walk mode it's possible to
> > succeed in getting the ->get_link() method pointer but the link
> > path
> > string be deallocated while it's being used.
> > 
> > Utilize the rcu mechanism to mitigate this risk.
> > 
> > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/xfs/kmem.h      |    4 ++++
> >  fs/xfs/xfs_inode.c |    4 ++--
> >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index 54da6d717a06..c1bd1103b340 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -61,6 +61,10 @@ static inline void  kmem_free(const void *ptr)
> >  {
> >         kvfree(ptr);
> >  }
> > +static inline void  kmem_free_rcu(const void *ptr)
> > +{
> > +       kvfree_rcu(ptr);
> > +}
> >  
> >  
> >  static inline void *
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index a4f6f034fb81..aaa1911e61ed 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2650,8 +2650,8 @@ xfs_ifree(
> >          * already been freed by xfs_attr_inactive.
> >          */
> >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > -               kmem_free(ip->i_df.if_u1.if_data);
> > -               ip->i_df.if_u1.if_data = NULL;
> > +               kmem_free_rcu(ip->i_df.if_u1.if_data);
> > +               RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
> >                 ip->i_df.if_bytes = 0;
> >         }
> 
> How do we get here in a way that the VFS will walk into this inode
> during a lookup?
> 
> I mean, the dentry has to be validated and held during the RCU path
> walk, so if we are running a transaction to mark the inode as free
> here it has already been unlinked and the dentry turned
> negative. So anything that is doing a lockless pathwalk onto that
> dentry *should* see that it is a negative dentry at this point and
> hence nothing should be walking any further or trying to access the
> link that was shared from ->get_link().
> 
> AFAICT, that's what the sequence check bug you fixed in the previous
> patch guarantees. It makes no difference if the unlinked inode has
> been recycled or not, the lookup race condition is the same in that
> the inode has gone through ->destroy_inode and is now owned by the
> filesystem and not the VFS.

That's right.

The concern is that the process that's doing the release is different
so ->destroy_inode() can be called at "any" time during an rcu-mode
walk (since its not holding references). Like just after the sequence
check and ->get_link() pointer read racing with a concurrent unlink
of the symlink.

The race window must be very small indeed but I thought it was
possible.

Your right, the first question to answer is whether this is in fact
needed at all.

Ian
> 
> Otherwise, it might just be best to memset the buffer to zero here
> rather than free it, and leave it to be freed when the inode is
> freed from the RCU callback in xfs_inode_free_callback() as per
> normal.
> 
> Cheers,
> 
> Dave.
Miklos Szeredi Nov. 12, 2021, 7:23 a.m. UTC | #5
On Fri, 12 Nov 2021 at 01:32, Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > When following a trailing symlink in rcu-walk mode it's possible to
> > succeed in getting the ->get_link() method pointer but the link path
> > string be deallocated while it's being used.
> >
> > Utilize the rcu mechanism to mitigate this risk.
> >
> > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/xfs/kmem.h      |    4 ++++
> >  fs/xfs/xfs_inode.c |    4 ++--
> >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index 54da6d717a06..c1bd1103b340 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -61,6 +61,10 @@ static inline void  kmem_free(const void *ptr)
> >  {
> >       kvfree(ptr);
> >  }
> > +static inline void  kmem_free_rcu(const void *ptr)
> > +{
> > +     kvfree_rcu(ptr);
> > +}
> >
> >
> >  static inline void *
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index a4f6f034fb81..aaa1911e61ed 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2650,8 +2650,8 @@ xfs_ifree(
> >        * already been freed by xfs_attr_inactive.
> >        */
> >       if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > -             kmem_free(ip->i_df.if_u1.if_data);
> > -             ip->i_df.if_u1.if_data = NULL;
> > +             kmem_free_rcu(ip->i_df.if_u1.if_data);
> > +             RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
> >               ip->i_df.if_bytes = 0;
> >       }
>
> How do we get here in a way that the VFS will walk into this inode
> during a lookup?
>
> I mean, the dentry has to be validated and held during the RCU path
> walk, so if we are running a transaction to mark the inode as free
> here it has already been unlinked and the dentry turned
> negative. So anything that is doing a lockless pathwalk onto that
> dentry *should* see that it is a negative dentry at this point and
> hence nothing should be walking any further or trying to access the
> link that was shared from ->get_link().
>
> AFAICT, that's what the sequence check bug you fixed in the previous
> patch guarantees. It makes no difference if the unlinked inode has
> been recycled or not, the lookup race condition is the same in that
> the inode has gone through ->destroy_inode and is now owned by the
> filesystem and not the VFS.

Yes, the concern here is that without locking all the above can
theoretically happen between the sequence number check and  if_data
being dereferenced.

> Otherwise, it might just be best to memset the buffer to zero here
> rather than free it, and leave it to be freed when the inode is
> freed from the RCU callback in xfs_inode_free_callback() as per
> normal.

My suggestion was to use .free_inode instead of .destroy_inode, the
former always being called after an RCU grace period.

Thanks,
Miklos
Brian Foster Nov. 12, 2021, 11:47 a.m. UTC | #6
On Fri, Nov 12, 2021 at 07:10:19AM +0800, Ian Kent wrote:
> On Thu, 2021-11-11 at 11:08 -0500, Brian Foster wrote:
> 
> Hi Brian,
> 
> > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > > When following a trailing symlink in rcu-walk mode it's possible to
> > > succeed in getting the ->get_link() method pointer but the link
> > > path
> > > string be deallocated while it's being used.
> > > 
> > > Utilize the rcu mechanism to mitigate this risk.
> > > 
> > > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > ---
> > >  fs/xfs/kmem.h      |    4 ++++
> > >  fs/xfs/xfs_inode.c |    4 ++--
> > >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> > >  3 files changed, 14 insertions(+), 4 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a607d6aca5c4..2977e19da7b7 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -524,11 +524,17 @@ xfs_vn_get_link_inline(
> > >  
> > >         /*
> > >          * The VFS crashes on a NULL pointer, so return -
> > > EFSCORRUPTED if
> > > -        * if_data is junk.
> > > +        * if_data is junk. Also, if the path walk is in rcu-walk
> > > mode
> > > +        * and the inode link path has gone away due inode re-use
> > > we have
> > > +        * no choice but to tell the VFS to redo the lookup.
> > >          */
> > > -       link = ip->i_df.if_u1.if_data;
> > > +       link = rcu_dereference(ip->i_df.if_u1.if_data);
> > > +       if (!dentry && !link)
> > > +               return ERR_PTR(-ECHILD);
> > > +
> > 
> > One thing that concerns me slightly about this approach is that inode
> > reuse does not necessarily guarantee that if_data is NULL. It seems
> > technically just as possible (even if exceedingly unlikely) for link
> > to
> > point at newly allocated memory since the previous sequence count
> > validation check. The inode could be reused as another inline symlink
> > for example, though it's not clear to me if that is really a problem
> > for
> > the vfs (assuming a restart would just land on the new link
> > anyways?).
> > But the inode could also be reallocated as something like a shortform
> > directory, which means passing directory header data or whatever that
> > it
> > stores in if_data back to pick_link(), which is then further
> > processed
> > as a string.
> 
> This is the sort of feedback I was hoping for.
> 
> This sounds related to the life-cycle of xfs inodes and re-use.
> Hopefully someone here on the list can enlighten me on this.
> 
> The thing that comes to mind is that the inode re-use would
> need to occur between the VFS check that validates the inode
> is still ok and the use of link string. I think that can still
> go away even with the above check.
> 

Yeah... The original NULL ->get_link() problem was replicated with a
small delay in the lookup path (specifically in the symlink processing
path). This essentially widens the race window and allows a separate
task to invalidate the dentry between the time the last dentry sequence
validation occurred (and passed) and the attempt to call ->get_link()
becomes imminent. I think patch 1 largely addresses this issue because
we'll have revalidated the previous read of the function pointer before
we attempt to call it.

That leads to this patch, which suggests that even after the validation
fix a small race window still technically exists with the ->get_link()
code and inode teardown. In fact, it's not that hard to show that this
is true by modifying the original reproducer to push the delay out
beyond the check added by patch 1 (or into the ->get_link() callback).
Playing around with that a bit, it's possible to induce a ->get_link()
call to an inode that was reallocated as a shortform directory and
returns a non-NULL if_data fork of that dir back to the vfs (to be
interpreted as a symlink string). Nothing seems to explode on my quick
test, fortunately, but I don't think that's an interaction we want to
maintain.

Of course one caveat to all of that is that after patch 1, the race
window for that one might be so small as to make this impossible to
reproduce in practice (whereas the problem fixed by patch 1 has been
reproduced by users)...

> Hopefully someone can clarify what happens here.
> 
> > 
> > With that, I wonder why we wouldn't just return -ECHILD here like we
> > do
> > for the non-inline case to address the immediate problem, and then
> > perhaps separately consider if we can rework bits of the
> > reuse/reclaim
> > code to allow rcu lookup of inline symlinks under certain conditions.
> 
> Always switching to ref-walk mode would certainly resolve the
> problem too, yes, perhaps we have no choice ...
> 

Oh I don't think it's the only choice. I think Miklos' suggestion to use
->free_inode() is probably the right general approach. I just think a
switch to ref-walk mode might be a good incremental step to fix this
problem in a backportable way (s_op->free_inode() is newer relative to
the introduction of _get_link_inline()). We can always re-enable rcu
symlink processing once we get our inode teardown/reuse bits fixed up
accordingly.. Just my .02.

Brian

> Ian
> > 
> > FWIW, I'm also a little curious why we don't set i_link for inline
> > symlinks. I don't think that addresses this validation problem, but
> > perhaps might allow rcu lookups in the inline symlink common case
> > where
> > things don't change during the lookup (and maybe even eliminate the
> > need
> > for this custom inline callback)..?
> > 
> > Brian
> > 
> > >         if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > >                 return ERR_PTR(-EFSCORRUPTED);
> > > +
> > >         return link;
> > >  }
> > >  
> > > 
> > > 
> > 
> 
>
Ian Kent Nov. 13, 2021, 5:17 a.m. UTC | #7
On Fri, 2021-11-12 at 06:47 -0500, Brian Foster wrote:
> On Fri, Nov 12, 2021 at 07:10:19AM +0800, Ian Kent wrote:
> > On Thu, 2021-11-11 at 11:08 -0500, Brian Foster wrote:
> > 
> > Hi Brian,
> > 
> > > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > > > When following a trailing symlink in rcu-walk mode it's
> > > > possible to
> > > > succeed in getting the ->get_link() method pointer but the link
> > > > path
> > > > string be deallocated while it's being used.
> > > > 
> > > > Utilize the rcu mechanism to mitigate this risk.
> > > > 
> > > > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > ---
> > > >  fs/xfs/kmem.h      |    4 ++++
> > > >  fs/xfs/xfs_inode.c |    4 ++--
> > > >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> > > >  3 files changed, 14 insertions(+), 4 deletions(-)
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index a607d6aca5c4..2977e19da7b7 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -524,11 +524,17 @@ xfs_vn_get_link_inline(
> > > >  
> > > >         /*
> > > >          * The VFS crashes on a NULL pointer, so return -
> > > > EFSCORRUPTED if
> > > > -        * if_data is junk.
> > > > +        * if_data is junk. Also, if the path walk is in rcu-
> > > > walk
> > > > mode
> > > > +        * and the inode link path has gone away due inode re-
> > > > use
> > > > we have
> > > > +        * no choice but to tell the VFS to redo the lookup.
> > > >          */
> > > > -       link = ip->i_df.if_u1.if_data;
> > > > +       link = rcu_dereference(ip->i_df.if_u1.if_data);
> > > > +       if (!dentry && !link)
> > > > +               return ERR_PTR(-ECHILD);
> > > > +
> > > 
> > > One thing that concerns me slightly about this approach is that
> > > inode
> > > reuse does not necessarily guarantee that if_data is NULL. It
> > > seems
> > > technically just as possible (even if exceedingly unlikely) for
> > > link
> > > to
> > > point at newly allocated memory since the previous sequence count
> > > validation check. The inode could be reused as another inline
> > > symlink
> > > for example, though it's not clear to me if that is really a
> > > problem
> > > for
> > > the vfs (assuming a restart would just land on the new link
> > > anyways?).
> > > But the inode could also be reallocated as something like a
> > > shortform
> > > directory, which means passing directory header data or whatever
> > > that
> > > it
> > > stores in if_data back to pick_link(), which is then further
> > > processed
> > > as a string.
> > 
> > This is the sort of feedback I was hoping for.
> > 
> > This sounds related to the life-cycle of xfs inodes and re-use.
> > Hopefully someone here on the list can enlighten me on this.
> > 
> > The thing that comes to mind is that the inode re-use would
> > need to occur between the VFS check that validates the inode
> > is still ok and the use of link string. I think that can still
> > go away even with the above check.
> > 
> 
> Yeah... The original NULL ->get_link() problem was replicated with a
> small delay in the lookup path (specifically in the symlink
> processing
> path). This essentially widens the race window and allows a separate
> task to invalidate the dentry between the time the last dentry
> sequence
> validation occurred (and passed) and the attempt to call ->get_link()
> becomes imminent. I think patch 1 largely addresses this issue
> because
> we'll have revalidated the previous read of the function pointer
> before
> we attempt to call it.
> 
> That leads to this patch, which suggests that even after the
> validation
> fix a small race window still technically exists with the -
> >get_link()
> code and inode teardown. In fact, it's not that hard to show that
> this
> is true by modifying the original reproducer to push the delay out
> beyond the check added by patch 1 (or into the ->get_link()
> callback).
> Playing around with that a bit, it's possible to induce a -
> >get_link()
> call to an inode that was reallocated as a shortform directory and
> returns a non-NULL if_data fork of that dir back to the vfs (to be
> interpreted as a symlink string). Nothing seems to explode on my
> quick
> test, fortunately, but I don't think that's an interaction we want to
> maintain.
> 
> Of course one caveat to all of that is that after patch 1, the race
> window for that one might be so small as to make this impossible to
> reproduce in practice (whereas the problem fixed by patch 1 has been
> reproduced by users)...
> 
> > Hopefully someone can clarify what happens here.
> > 
> > > 
> > > With that, I wonder why we wouldn't just return -ECHILD here like
> > > we
> > > do
> > > for the non-inline case to address the immediate problem, and
> > > then
> > > perhaps separately consider if we can rework bits of the
> > > reuse/reclaim
> > > code to allow rcu lookup of inline symlinks under certain
> > > conditions.
> > 
> > Always switching to ref-walk mode would certainly resolve the
> > problem too, yes, perhaps we have no choice ...
> > 
> 
> Oh I don't think it's the only choice. I think Miklos' suggestion to
> use
> ->free_inode() is probably the right general approach. I just think a
> switch to ref-walk mode might be a good incremental step to fix this
> problem in a backportable way (s_op->free_inode() is newer relative
> to
> the introduction of _get_link_inline()). We can always re-enable rcu
> symlink processing once we get our inode teardown/reuse bits fixed up
> accordingly.. Just my .02.

Yes, I've had a change of heart on this too.

I think returning -ECHILD from xfs_vn_get_link_inline() is the
best solution.

There are a couple of reasons for that, the main one being the
link string can still go away while the VFS is using it, but
also Al has said more than once that switching to ref-walk mode
is not a big deal and that makes the problems vanish completely.
In any case references are taken at successful walk completion
anyway.

If it's found staying rcu-walk mode whenever possible is worth
while in cases like this then there's probably a lot more to do
to do this properly. The lockless stuff is tricky and error prone
(certainly it is for me) and side effects are almost always hiding
in unexpected places.

So as you say, that's something for another day.
I'll update the patch and post an update.

Ian 
> 
> Brian
> 
> > Ian
> > > 
> > > FWIW, I'm also a little curious why we don't set i_link for
> > > inline
> > > symlinks. I don't think that addresses this validation problem,
> > > but
> > > perhaps might allow rcu lookups in the inline symlink common case
> > > where
> > > things don't change during the lookup (and maybe even eliminate
> > > the
> > > need
> > > for this custom inline callback)..?
> > > 
> > > Brian
> > > 
> > > >         if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > >                 return ERR_PTR(-EFSCORRUPTED);
> > > > +
> > > >         return link;
> > > >  }
> > > >  
> > > > 
> > > > 
> > > 
> > 
> > 
>
Dave Chinner Nov. 14, 2021, 11:18 p.m. UTC | #8
On Fri, Nov 12, 2021 at 08:23:24AM +0100, Miklos Szeredi wrote:
> On Fri, 12 Nov 2021 at 01:32, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > > When following a trailing symlink in rcu-walk mode it's possible to
> > > succeed in getting the ->get_link() method pointer but the link path
> > > string be deallocated while it's being used.
> > >
> > > Utilize the rcu mechanism to mitigate this risk.
> > >
> > > Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
> > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > ---
> > >  fs/xfs/kmem.h      |    4 ++++
> > >  fs/xfs/xfs_inode.c |    4 ++--
> > >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> > >  3 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > > index 54da6d717a06..c1bd1103b340 100644
> > > --- a/fs/xfs/kmem.h
> > > +++ b/fs/xfs/kmem.h
> > > @@ -61,6 +61,10 @@ static inline void  kmem_free(const void *ptr)
> > >  {
> > >       kvfree(ptr);
> > >  }
> > > +static inline void  kmem_free_rcu(const void *ptr)
> > > +{
> > > +     kvfree_rcu(ptr);
> > > +}
> > >
> > >
> > >  static inline void *
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index a4f6f034fb81..aaa1911e61ed 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -2650,8 +2650,8 @@ xfs_ifree(
> > >        * already been freed by xfs_attr_inactive.
> > >        */
> > >       if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > -             kmem_free(ip->i_df.if_u1.if_data);
> > > -             ip->i_df.if_u1.if_data = NULL;
> > > +             kmem_free_rcu(ip->i_df.if_u1.if_data);
> > > +             RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
> > >               ip->i_df.if_bytes = 0;
> > >       }
> >
> > How do we get here in a way that the VFS will walk into this inode
> > during a lookup?
> >
> > I mean, the dentry has to be validated and held during the RCU path
> > walk, so if we are running a transaction to mark the inode as free
> > here it has already been unlinked and the dentry turned
> > negative. So anything that is doing a lockless pathwalk onto that
> > dentry *should* see that it is a negative dentry at this point and
> > hence nothing should be walking any further or trying to access the
> > link that was shared from ->get_link().
> >
> > AFAICT, that's what the sequence check bug you fixed in the previous
> > patch guarantees. It makes no difference if the unlinked inode has
> > been recycled or not, the lookup race condition is the same in that
> > the inode has gone through ->destroy_inode and is now owned by the
> > filesystem and not the VFS.
> 
> Yes, the concern here is that without locking all the above can
> theoretically happen between the sequence number check and  if_data
> being dereferenced.

It would be good to describe the race condition in the commit message,
because it's not at all obvious to readers how this race condition
is triggered.

> > Otherwise, it might just be best to memset the buffer to zero here
> > rather than free it, and leave it to be freed when the inode is
> > freed from the RCU callback in xfs_inode_free_callback() as per
> > normal.

Just as a FYI ext4_evict_inode() does exactly this with it's inline
symlink data buffer it passes to ->get_link() when it is freeing the
inode when the last reference to an unlinked inode goes away:

        /*
         * Set inode->i_size to 0 before calling ext4_truncate(). We need
         * special handling of symlinks here because i_size is used to
         * determine whether ext4_inode_info->i_data contains symlink data or
         * block mappings. Setting i_size to 0 will remove its fast symlink
         * status. Erase i_data so that it becomes a valid empty block map.
         */
        if (ext4_inode_is_fast_symlink(inode))
                memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
	inode->i_size = 0;

IOWs, if the pointer returned from ->get_link() on an ext4
filesystem is accessed by the VFS after ->evict() is called, then it
sees an empty buffer. IOWs, it's just not safe for the VFS to access
the inode's link buffer pointer the moment the last reference to an
unlinked inode goes away because it's contents are no longer
guaranteed to be valid.

I note that ext4 then immediately sets the VFS inode size to 0
length, indicating the link is no longer valid. It may well be that
XFS is not setting the VFS inode size to zero in this case (I don't
think the generic evict() path does that) so perhaps that's a guard
that could be used at the VFS level to avoid the race condition...

> My suggestion was to use .free_inode instead of .destroy_inode, the
> former always being called after an RCU grace period.

Which doesn't address the ext4 "zero the buffer in .evict", either.

And for XFS, it means we then likely need to add synchronise_rcu()
calls wherever we need to wait for inodes to be inactivated and/or
reclaimed because they won't even get queued for inactivation until
the current grace period expires. That's a potential locking
nightmare because inode inactivation and reclaim uses rcu protected
lockless algorithms to access XFS inodes without taking reference
counts....

I just can't see how this race condition is XFS specific and why
fixing it requires XFS to sepcifically handle it while we ignore
similar theoretical issues in other filesystems...

Cheers,

Dave.
Miklos Szeredi Nov. 15, 2021, 9:21 a.m. UTC | #9
On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@fromorbit.com> wrote:
> I just can't see how this race condition is XFS specific and why
> fixing it requires XFS to sepcifically handle it while we ignore
> similar theoretical issues in other filesystems...

It is XFS specific, because all other filesystems RCU free the in-core
inode after eviction.

XFS is the only one that reuses the in-core inode object and that is
very much different from anything the other filesystems do and what
the VFS expects.

I don't see how clearing the quick link buffer in ext4_evict_inode()
could do anything bad.  The contents are irrelevant, the lookup will
be restarted anyway, the important thing is that the buffer is not
freed and that it's null terminated, and both hold for the ext4,
AFAICS.

I tend to agree with Brian and Ian at this point: return -ECHILD from
xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
implications are fully dealt with.  No way to fix this from VFS alone.

Thanks,
Miklos
Dave Chinner Nov. 15, 2021, 10:24 p.m. UTC | #10
On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@fromorbit.com> wrote:
> > I just can't see how this race condition is XFS specific and why
> > fixing it requires XFS to sepcifically handle it while we ignore
> > similar theoretical issues in other filesystems...
> 
> It is XFS specific, because all other filesystems RCU free the in-core
> inode after eviction.
> 
> XFS is the only one that reuses the in-core inode object and that is
> very much different from anything the other filesystems do and what
> the VFS expects.

Sure, but I was refering to the xfs_ifree issue that the patch
addressed, not the re-use issue that the *first patch addressed*.

> I don't see how clearing the quick link buffer in ext4_evict_inode()
> could do anything bad.  The contents are irrelevant, the lookup will
> be restarted anyway, the important thing is that the buffer is not
> freed and that it's null terminated, and both hold for the ext4,
> AFAICS.

You miss the point (which, admittedly, probably wasn't clear).

I suggested just zeroing the buffer in xfs_ifree instead of zeroing
it, which you seemed to suggest wouldn't work and we should move the
XFS functionality to .free_inode. That's what I was refering to as
"not being XFS specific" - if it is safe for ext4 to zero the link
buffer in .evict while lockless lookups can still be accessing the
link buffer, it is safe for XFS to do the same thing in .destroy
context.

If it isn't safe for ext4 to do that, then we have a general
pathwalk problem, not an XFS issue. But, as you say, it is safe to
do this zeroing, so the fix to xfs_ifree() is to zero the link
buffer instead of freeing it, just like ext4 does.

As a side issue, we really don't want to move what XFS does in
.destroy_inode to .free_inode because that then means we need to add
synchronise_rcu() calls everywhere in XFS that might need to wait on
inodes being inactivated and/or reclaimed. And because inode reclaim
uses lockless rcu lookups, there's substantial danger of adding rcu
callback related deadlocks to XFS here. That's just not a direction
we should be moving in.

I'll also point out that this would require XFS inodes to pass
through *two* rcu grace periods before the memory they hold could be
freed because, as I mentioned, xfs inode reclaim uses rcu protected
inode lookups and so relies on inodes to be freed by rcu callback...

> I tend to agree with Brian and Ian at this point: return -ECHILD from
> xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> implications are fully dealt with.  No way to fix this from VFS alone.

I disagree from a fundamental process POV - this is just sweeping
the issue under the table and leaving it for someone else to solve
because the root cause of the inode re-use issue has not been
identified. But to the person who architected the lockless XFS inode
cache 15 years ago, it's pretty obvious, so let's just solve it now.

With the xfs_ifree() problem solved by zeroing rather than freeing,
then the only other problem is inode reuse *within an rcu grace
period*. Immediate inode reuse tends to be rare, (we can actually
trace occurrences to validate this assertion), and implementation
wise reuse is isolated to a single function: xfs_iget_recycle().

xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
that found the inode marks it as being reclaimed (preventing other
lookups from finding it), then re-initialises the inode. This is
what makes .get_link change in the middle of pathwalk - we're
reinitialising the inode without waiting for the RCU grace period to
expire.

The obvious thing to do here is that after we drop the RCU read
context, we simply call synchronize_rcu() before we start
re-initialising the inode to wait for the current grace period to
expire. This ensures that any pathwalk that may have found that
inode has seen the sequence number change and droppped out of
lockless mode and is no longer trying to access that inode.  Then we
can safely reinitialise the inode as it has passed through a RCU
grace period just like it would have if it was freed and
reallocated.

This completely removes the entire class of "reused inodes race with
VFS level RCU walks" bugs from the XFS inode cache implementation,
hence XFS inodes behave the same as all other filesystems w.r.t RCU
grace period expiry needing to occur before a VFS inode is reused.

So, it looks like three patches to fix this entirely:

1. the pathwalk link sequence check fix
2. zeroing the inline link buffer in xfs_ifree()
3. adding synchronize_rcu() (or some variant) to xfs_iget_recycle()

Cheers,

Dave.
Ian Kent Nov. 16, 2021, 1:03 a.m. UTC | #11
On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> > On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@fromorbit.com>
> > wrote:
> > > I just can't see how this race condition is XFS specific and why
> > > fixing it requires XFS to sepcifically handle it while we ignore
> > > similar theoretical issues in other filesystems...
> > 
> > It is XFS specific, because all other filesystems RCU free the in-
> > core
> > inode after eviction.
> > 
> > XFS is the only one that reuses the in-core inode object and that
> > is
> > very much different from anything the other filesystems do and what
> > the VFS expects.
> 
> Sure, but I was refering to the xfs_ifree issue that the patch
> addressed, not the re-use issue that the *first patch addressed*.
> 
> > I don't see how clearing the quick link buffer in
> > ext4_evict_inode()
> > could do anything bad.  The contents are irrelevant, the lookup
> > will
> > be restarted anyway, the important thing is that the buffer is not
> > freed and that it's null terminated, and both hold for the ext4,
> > AFAICS.
> 
> You miss the point (which, admittedly, probably wasn't clear).
> 
> I suggested just zeroing the buffer in xfs_ifree instead of zeroing
> it, which you seemed to suggest wouldn't work and we should move the
> XFS functionality to .free_inode. That's what I was refering to as
> "not being XFS specific" - if it is safe for ext4 to zero the link
> buffer in .evict while lockless lookups can still be accessing the
> link buffer, it is safe for XFS to do the same thing in .destroy
> context.

I'll need to think about that for a while.

Zeroing the buffer while it's being used seems like a problem to
me and was what this patch was trying to avoid.

I thought all that would be needed for this to happen is for a
dentry drop to occur while the link walk was happening after
->get_link() had returned the pointer.

What have I got wrong in that thinking?

> 
> If it isn't safe for ext4 to do that, then we have a general
> pathwalk problem, not an XFS issue. But, as you say, it is safe to
> do this zeroing, so the fix to xfs_ifree() is to zero the link
> buffer instead of freeing it, just like ext4 does.
> 
> As a side issue, we really don't want to move what XFS does in
> .destroy_inode to .free_inode because that then means we need to add
> synchronise_rcu() calls everywhere in XFS that might need to wait on
> inodes being inactivated and/or reclaimed. And because inode reclaim
> uses lockless rcu lookups, there's substantial danger of adding rcu
> callback related deadlocks to XFS here. That's just not a direction
> we should be moving in.

Another reason I decided to use the ECHILD return instead is that
I thought synchronise_rcu() might add an unexpected delay.

Since synchronise_rcu() will only wait for processes that currently
have the rcu read lock do you think that could actually be a problem
in this code path?

> 
> I'll also point out that this would require XFS inodes to pass
> through *two* rcu grace periods before the memory they hold could be
> freed because, as I mentioned, xfs inode reclaim uses rcu protected
> inode lookups and so relies on inodes to be freed by rcu callback...
> 
> > I tend to agree with Brian and Ian at this point: return -ECHILD
> > from
> > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> > implications are fully dealt with.  No way to fix this from VFS
> > alone.
> 
> I disagree from a fundamental process POV - this is just sweeping
> the issue under the table and leaving it for someone else to solve
> because the root cause of the inode re-use issue has not been
> identified. But to the person who architected the lockless XFS inode
> cache 15 years ago, it's pretty obvious, so let's just solve it now.

Sorry, I don't understand what you mean by the root cause not
being identified?

Until lockless path walking was introduced this wasn't a problem
because references were held during walks so there could never be
a final dput() to trigger freeing process during a walk. And a lot
of code probably still makes that assumption. And code that does
make that assumption should return -ECHILD in cases like this so
that the VFS can either legitimize the struct path (by taking
references) or restart the walk in ref-walk mode.

Can you elaborate please?

> 
> With the xfs_ifree() problem solved by zeroing rather than freeing,
> then the only other problem is inode reuse *within an rcu grace
> period*. Immediate inode reuse tends to be rare, (we can actually
> trace occurrences to validate this assertion), and implementation
> wise reuse is isolated to a single function: xfs_iget_recycle().
> 
> xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
> that found the inode marks it as being reclaimed (preventing other
> lookups from finding it), then re-initialises the inode. This is
> what makes .get_link change in the middle of pathwalk - we're
> reinitialising the inode without waiting for the RCU grace period to
> expire.

Ok, good to know that, there's a lot of icache code to look
through, ;)

At this point I come back to thinking the original patch might
be sufficient. But then that's only for xfs and excludes
potential problems with other file systems so I'll not go
there.

> 
> The obvious thing to do here is that after we drop the RCU read
> context, we simply call synchronize_rcu() before we start
> re-initialising the inode to wait for the current grace period to
> expire. This ensures that any pathwalk that may have found that
> inode has seen the sequence number change and droppped out of
> lockless mode and is no longer trying to access that inode.  Then we
> can safely reinitialise the inode as it has passed through a RCU
> grace period just like it would have if it was freed and
> reallocated.

Sounds right to me, as long as it is ok to call synchronize_rcu()
here.

> 
> This completely removes the entire class of "reused inodes race with
> VFS level RCU walks" bugs from the XFS inode cache implementation,
> hence XFS inodes behave the same as all other filesystems w.r.t RCU
> grace period expiry needing to occur before a VFS inode is reused.
> 
> So, it looks like three patches to fix this entirely:
> 
> 1. the pathwalk link sequence check fix
> 2. zeroing the inline link buffer in xfs_ifree()

I'm sorry but I'm really having trouble understanding how this is
ok. If some process is using the buffer to walk a link path how
can zeroing the contents of the buffer be ok?

> 3. adding synchronize_rcu() (or some variant) to xfs_iget_recycle()
> 
> Cheers,
> 
> Dave.
Dave Chinner Nov. 16, 2021, 3:01 a.m. UTC | #12
On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> > > On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@fromorbit.com>
> > > wrote:
> > > > I just can't see how this race condition is XFS specific and why
> > > > fixing it requires XFS to sepcifically handle it while we ignore
> > > > similar theoretical issues in other filesystems...
> > > 
> > > It is XFS specific, because all other filesystems RCU free the in-
> > > core
> > > inode after eviction.
> > > 
> > > XFS is the only one that reuses the in-core inode object and that
> > > is
> > > very much different from anything the other filesystems do and what
> > > the VFS expects.
> > 
> > Sure, but I was refering to the xfs_ifree issue that the patch
> > addressed, not the re-use issue that the *first patch addressed*.
> > 
> > > I don't see how clearing the quick link buffer in
> > > ext4_evict_inode()
> > > could do anything bad.  The contents are irrelevant, the lookup
> > > will
> > > be restarted anyway, the important thing is that the buffer is not
> > > freed and that it's null terminated, and both hold for the ext4,
> > > AFAICS.
> > 
> > You miss the point (which, admittedly, probably wasn't clear).
> > 
> > I suggested just zeroing the buffer in xfs_ifree instead of zeroing
> > it, which you seemed to suggest wouldn't work and we should move the
> > XFS functionality to .free_inode. That's what I was refering to as
> > "not being XFS specific" - if it is safe for ext4 to zero the link
> > buffer in .evict while lockless lookups can still be accessing the
> > link buffer, it is safe for XFS to do the same thing in .destroy
> > context.
> 
> I'll need to think about that for a while.
> 
> Zeroing the buffer while it's being used seems like a problem to
> me and was what this patch was trying to avoid.

*nod*

That was my reading of the situation when I saw what ext4 was doing.
But Miklos says that this is fine, and I don't know the code well
enough to say he's wrong. So if it's ok for ext4, it's OK for XFS.
If it's not OK for XFS, then it isn't OK for ext4 either, and we
have more bugs to fix than just in XFS.

> I thought all that would be needed for this to happen is for a
> dentry drop to occur while the link walk was happening after
> ->get_link() had returned the pointer.
> 
> What have I got wrong in that thinking?

Nothing that I can see, but see my previous statement above.

I *think* that just zeroing the buffer means the race condition
means the link resolves as either wholly intact, partially zeroed
with trailing zeros in the length, wholly zeroed or zero length.
Nothing will crash, the link string is always null terminated even
if the length is wrong, and so nothing bad should happen as a result
of zeroing the symlink buffer when it gets evicted from the VFS
inode cache after unlink.

> > If it isn't safe for ext4 to do that, then we have a general
> > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > link buffer instead of freeing it, just like ext4 does.
> > 
> > As a side issue, we really don't want to move what XFS does in
> > .destroy_inode to .free_inode because that then means we need to
> > add synchronise_rcu() calls everywhere in XFS that might need to
> > wait on inodes being inactivated and/or reclaimed. And because
> > inode reclaim uses lockless rcu lookups, there's substantial
> > danger of adding rcu callback related deadlocks to XFS here.
> > That's just not a direction we should be moving in.
> 
> Another reason I decided to use the ECHILD return instead is that
> I thought synchronise_rcu() might add an unexpected delay.

It depends where you put the synchronise_rcu() call. :)

> Since synchronise_rcu() will only wait for processes that
> currently have the rcu read lock do you think that could actually
> be a problem in this code path?

No, I don't think it will.  The inode recycle case in XFS inode
lookup can trigger in two cases:

1. VFS cache eviction followed by immediate lookup
2. Inode has been unlinked and evicted, then free and reallocated by
the filesytsem.

In case #1, that's a cold cache lookup and hence delays are
acceptible (e.g. a slightly longer delay might result in having to
fetch the inode from disk again). Calling synchronise_rcu() in this
case is not going to be any different from having to fetch the inode
from disk...

In case #2, there's a *lot* of CPU work being done to modify
metadata (inode btree updates, etc), and so the operations can block
on journal space, metadata IO, etc. Delays are acceptible, and could
be in the order of hundreds of milliseconds if the transaction
subsystem is bottlenecked. waiting for an RCU grace period when we
reallocate an indoe immediately after freeing it isn't a big deal.

IOWs, if synchronize_rcu() turns out to be a problem, we can
optimise that separately - we need to correct the inode reuse
behaviour w.r.t. VFS RCU expectations, then we can optimise the
result if there are perf problems stemming from correct behaviour.

> > I'll also point out that this would require XFS inodes to pass
> > through *two* rcu grace periods before the memory they hold could be
> > freed because, as I mentioned, xfs inode reclaim uses rcu protected
> > inode lookups and so relies on inodes to be freed by rcu callback...
> > 
> > > I tend to agree with Brian and Ian at this point: return -ECHILD
> > > from
> > > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> > > implications are fully dealt with.  No way to fix this from VFS
> > > alone.
> > 
> > I disagree from a fundamental process POV - this is just sweeping
> > the issue under the table and leaving it for someone else to solve
> > because the root cause of the inode re-use issue has not been
> > identified. But to the person who architected the lockless XFS inode
> > cache 15 years ago, it's pretty obvious, so let's just solve it now.
> 
> Sorry, I don't understand what you mean by the root cause not
> being identified?

The whole approach of "we don't know how to fix the inode reuse case
so disable it" implies that nobody has understood where in the reuse
case the problem lies. i.e. "inode reuse" by itself is not the root
cause of the problem.

The root cause is "allowing an inode to be reused without waiting
for an RCU grace period to expire". This might seem pedantic, but
"without waiting for an rcu grace period to expire" is the important
part of the problem (i.e. the bug), not the "allowing an inode to be
reused" bit.

Once the RCU part of the problem is pointed out, the solution
becomes obvious. As nobody had seen the obvious (wait for an RCU
grace period when recycling an inode) it stands to reason that
nobody really understood what the root cause of the inode reuse
problem.

> > With the xfs_ifree() problem solved by zeroing rather than freeing,
> > then the only other problem is inode reuse *within an rcu grace
> > period*. Immediate inode reuse tends to be rare, (we can actually
> > trace occurrences to validate this assertion), and implementation
> > wise reuse is isolated to a single function: xfs_iget_recycle().
> > 
> > xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
> > that found the inode marks it as being reclaimed (preventing other
> > lookups from finding it), then re-initialises the inode. This is
> > what makes .get_link change in the middle of pathwalk - we're
> > reinitialising the inode without waiting for the RCU grace period to
> > expire.
> 
> Ok, good to know that, there's a lot of icache code to look
> through, ;)

My point precisely. :)

Cheers,

Dave.
Miklos Szeredi Nov. 16, 2021, 10:12 a.m. UTC | #13
On Tue, 16 Nov 2021 at 04:01, Dave Chinner <david@fromorbit.com> wrote:

> I *think* that just zeroing the buffer means the race condition
> means the link resolves as either wholly intact, partially zeroed
> with trailing zeros in the length, wholly zeroed or zero length.
> Nothing will crash, the link string is always null terminated even
> if the length is wrong, and so nothing bad should happen as a result
> of zeroing the symlink buffer when it gets evicted from the VFS
> inode cache after unlink.

That's my thinking.  However, modifying the buffer while it is being
processed does seem pretty ugly, and I have to admit that I don't
understand why this needs to be done in either XFS or EXT4.

> The root cause is "allowing an inode to be reused without waiting
> for an RCU grace period to expire". This might seem pedantic, but
> "without waiting for an rcu grace period to expire" is the important
> part of the problem (i.e. the bug), not the "allowing an inode to be
> reused" bit.

Yes.

Thanks,
Miklos
Ian Kent Nov. 16, 2021, 1:38 p.m. UTC | #14
On Tue, 2021-11-16 at 14:01 +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> > > > On Mon, 15 Nov 2021 at 00:18, Dave Chinner
> > > > <david@fromorbit.com>
> > > > wrote:
> > > > > I just can't see how this race condition is XFS specific and
> > > > > why
> > > > > fixing it requires XFS to sepcifically handle it while we
> > > > > ignore
> > > > > similar theoretical issues in other filesystems...
> > > > 
> > > > It is XFS specific, because all other filesystems RCU free the
> > > > in-
> > > > core
> > > > inode after eviction.
> > > > 
> > > > XFS is the only one that reuses the in-core inode object and
> > > > that
> > > > is
> > > > very much different from anything the other filesystems do and
> > > > what
> > > > the VFS expects.
> > > 
> > > Sure, but I was refering to the xfs_ifree issue that the patch
> > > addressed, not the re-use issue that the *first patch addressed*.
> > > 
> > > > I don't see how clearing the quick link buffer in
> > > > ext4_evict_inode()
> > > > could do anything bad.  The contents are irrelevant, the lookup
> > > > will
> > > > be restarted anyway, the important thing is that the buffer is
> > > > not
> > > > freed and that it's null terminated, and both hold for the
> > > > ext4,
> > > > AFAICS.
> > > 
> > > You miss the point (which, admittedly, probably wasn't clear).
> > > 
> > > I suggested just zeroing the buffer in xfs_ifree instead of
> > > zeroing
> > > it, which you seemed to suggest wouldn't work and we should move
> > > the
> > > XFS functionality to .free_inode. That's what I was refering to
> > > as
> > > "not being XFS specific" - if it is safe for ext4 to zero the
> > > link
> > > buffer in .evict while lockless lookups can still be accessing
> > > the
> > > link buffer, it is safe for XFS to do the same thing in .destroy
> > > context.
> > 
> > I'll need to think about that for a while.
> > 
> > Zeroing the buffer while it's being used seems like a problem to
> > me and was what this patch was trying to avoid.
> 
> *nod*
> 
> That was my reading of the situation when I saw what ext4 was doing.
> But Miklos says that this is fine, and I don't know the code well
> enough to say he's wrong. So if it's ok for ext4, it's OK for XFS.
> If it's not OK for XFS, then it isn't OK for ext4 either, and we
> have more bugs to fix than just in XFS.
> 
> > I thought all that would be needed for this to happen is for a
> > dentry drop to occur while the link walk was happening after
> > ->get_link() had returned the pointer.
> > 
> > What have I got wrong in that thinking?
> 
> Nothing that I can see, but see my previous statement above.
> 
> I *think* that just zeroing the buffer means the race condition
> means the link resolves as either wholly intact, partially zeroed
> with trailing zeros in the length, wholly zeroed or zero length.
> Nothing will crash, the link string is always null terminated even
> if the length is wrong, and so nothing bad should happen as a result
> of zeroing the symlink buffer when it gets evicted from the VFS
> inode cache after unlink.

Oh, of course (sound of penny dropping), the walk will loop around
and an empty link string will essentially end the walk. What's
needed then is to look at what would be returned in that case.

So, there shouldn't be a crash then, and assuming a sensible walk
failure in this case ENOENT (dentry now negative) or similar is
most likely (need to check that).

> 
> > > If it isn't safe for ext4 to do that, then we have a general
> > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > link buffer instead of freeing it, just like ext4 does.
> > > 
> > > As a side issue, we really don't want to move what XFS does in
> > > .destroy_inode to .free_inode because that then means we need to
> > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > wait on inodes being inactivated and/or reclaimed. And because
> > > inode reclaim uses lockless rcu lookups, there's substantial
> > > danger of adding rcu callback related deadlocks to XFS here.
> > > That's just not a direction we should be moving in.
> > 
> > Another reason I decided to use the ECHILD return instead is that
> > I thought synchronise_rcu() might add an unexpected delay.
> 
> It depends where you put the synchronise_rcu() call. :)
> 
> > Since synchronise_rcu() will only wait for processes that
> > currently have the rcu read lock do you think that could actually
> > be a problem in this code path?
> 
> No, I don't think it will.  The inode recycle case in XFS inode
> lookup can trigger in two cases:
> 
> 1. VFS cache eviction followed by immediate lookup
> 2. Inode has been unlinked and evicted, then free and reallocated by
> the filesytsem.
> 
> In case #1, that's a cold cache lookup and hence delays are
> acceptible (e.g. a slightly longer delay might result in having to
> fetch the inode from disk again). Calling synchronise_rcu() in this
> case is not going to be any different from having to fetch the inode
> from disk...
> 
> In case #2, there's a *lot* of CPU work being done to modify
> metadata (inode btree updates, etc), and so the operations can block
> on journal space, metadata IO, etc. Delays are acceptible, and could
> be in the order of hundreds of milliseconds if the transaction
> subsystem is bottlenecked. waiting for an RCU grace period when we
> reallocate an indoe immediately after freeing it isn't a big deal.
> 
> IOWs, if synchronize_rcu() turns out to be a problem, we can
> optimise that separately - we need to correct the inode reuse
> behaviour w.r.t. VFS RCU expectations, then we can optimise the
> result if there are perf problems stemming from correct behaviour.

Sounds good, so a synchronize_rcu() in that particular location
would allow some time to rail the walk before the inode is re-used.
That should be quick enough to avoid any possible re-use races ...

Interesting ... 

OTOH ext4 is not a problem because the inode is going away not
being re-used so there's no potential race from filling in the
inode fields afresh.

I think that's the concern Miklos is alluding to.

> 
> > > I'll also point out that this would require XFS inodes to pass
> > > through *two* rcu grace periods before the memory they hold could
> > > be
> > > freed because, as I mentioned, xfs inode reclaim uses rcu
> > > protected
> > > inode lookups and so relies on inodes to be freed by rcu
> > > callback...
> > > 
> > > > I tend to agree with Brian and Ian at this point: return -
> > > > ECHILD
> > > > from
> > > > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> > > > implications are fully dealt with.  No way to fix this from VFS
> > > > alone.
> > > 
> > > I disagree from a fundamental process POV - this is just sweeping
> > > the issue under the table and leaving it for someone else to
> > > solve
> > > because the root cause of the inode re-use issue has not been
> > > identified. But to the person who architected the lockless XFS
> > > inode
> > > cache 15 years ago, it's pretty obvious, so let's just solve it
> > > now.
> > 
> > Sorry, I don't understand what you mean by the root cause not
> > being identified?
> 
> The whole approach of "we don't know how to fix the inode reuse case
> so disable it" implies that nobody has understood where in the reuse
> case the problem lies. i.e. "inode reuse" by itself is not the root
> cause of the problem.

Right, not strictly no.

> 
> The root cause is "allowing an inode to be reused without waiting
> for an RCU grace period to expire". This might seem pedantic, but
> "without waiting for an rcu grace period to expire" is the important
> part of the problem (i.e. the bug), not the "allowing an inode to be
> reused" bit.

Pedantic is good, it's needed in this case for sure.

Provided handling of the dentry (and indirectly the inode) is
done quickly. And zeroing the field should do just that. Trying
to preserve the old link path string isn't feasible, it could
take too long to resolve the path and possibly switch path walk
modes introducing side effects related to the rcu-grace expiring.
But the truth is the link is gone so failing the walk should be
a perfectly valid result.

> 
> Once the RCU part of the problem is pointed out, the solution
> becomes obvious. As nobody had seen the obvious (wait for an RCU
> grace period when recycling an inode) it stands to reason that
> nobody really understood what the root cause of the inode reuse
> problem.

Well, I guess, not completely, yes ...

I'll think about what's been discussed and wait for any further
contributions before doing anything else on this. In any case
there's a few things to look at resulting from the discussion.

Thanks for your patience with this,
Ian

> 
> > > With the xfs_ifree() problem solved by zeroing rather than
> > > freeing,
> > > then the only other problem is inode reuse *within an rcu grace
> > > period*. Immediate inode reuse tends to be rare, (we can actually
> > > trace occurrences to validate this assertion), and implementation
> > > wise reuse is isolated to a single function: xfs_iget_recycle().
> > > 
> > > xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
> > > that found the inode marks it as being reclaimed (preventing
> > > other
> > > lookups from finding it), then re-initialises the inode. This is
> > > what makes .get_link change in the middle of pathwalk - we're
> > > reinitialising the inode without waiting for the RCU grace period
> > > to
> > > expire.
> > 
> > Ok, good to know that, there's a lot of icache code to look
> > through, ;)
> 
> My point precisely. :)
> 
> Cheers,
> 
> Dave.
Brian Foster Nov. 16, 2021, 3:59 p.m. UTC | #15
On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> > > > On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@fromorbit.com>
> > > > wrote:
> > > > > I just can't see how this race condition is XFS specific and why
> > > > > fixing it requires XFS to sepcifically handle it while we ignore
> > > > > similar theoretical issues in other filesystems...
> > > > 
> > > > It is XFS specific, because all other filesystems RCU free the in-
> > > > core
> > > > inode after eviction.
> > > > 
> > > > XFS is the only one that reuses the in-core inode object and that
> > > > is
> > > > very much different from anything the other filesystems do and what
> > > > the VFS expects.
> > > 
> > > Sure, but I was refering to the xfs_ifree issue that the patch
> > > addressed, not the re-use issue that the *first patch addressed*.
> > > 
> > > > I don't see how clearing the quick link buffer in
> > > > ext4_evict_inode()
> > > > could do anything bad.  The contents are irrelevant, the lookup
> > > > will
> > > > be restarted anyway, the important thing is that the buffer is not
> > > > freed and that it's null terminated, and both hold for the ext4,
> > > > AFAICS.
> > > 
> > > You miss the point (which, admittedly, probably wasn't clear).
> > > 
> > > I suggested just zeroing the buffer in xfs_ifree instead of zeroing
> > > it, which you seemed to suggest wouldn't work and we should move the
> > > XFS functionality to .free_inode. That's what I was refering to as
> > > "not being XFS specific" - if it is safe for ext4 to zero the link
> > > buffer in .evict while lockless lookups can still be accessing the
> > > link buffer, it is safe for XFS to do the same thing in .destroy
> > > context.
> > 
> > I'll need to think about that for a while.
> > 
> > Zeroing the buffer while it's being used seems like a problem to
> > me and was what this patch was trying to avoid.
> 
> *nod*
> 
> That was my reading of the situation when I saw what ext4 was doing.
> But Miklos says that this is fine, and I don't know the code well
> enough to say he's wrong. So if it's ok for ext4, it's OK for XFS.
> If it's not OK for XFS, then it isn't OK for ext4 either, and we
> have more bugs to fix than just in XFS.
> 
> > I thought all that would be needed for this to happen is for a
> > dentry drop to occur while the link walk was happening after
> > ->get_link() had returned the pointer.
> > 
> > What have I got wrong in that thinking?
> 
> Nothing that I can see, but see my previous statement above.
> 
> I *think* that just zeroing the buffer means the race condition
> means the link resolves as either wholly intact, partially zeroed
> with trailing zeros in the length, wholly zeroed or zero length.
> Nothing will crash, the link string is always null terminated even
> if the length is wrong, and so nothing bad should happen as a result
> of zeroing the symlink buffer when it gets evicted from the VFS
> inode cache after unlink.
> 
> > > If it isn't safe for ext4 to do that, then we have a general
> > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > link buffer instead of freeing it, just like ext4 does.
> > > 
> > > As a side issue, we really don't want to move what XFS does in
> > > .destroy_inode to .free_inode because that then means we need to
> > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > wait on inodes being inactivated and/or reclaimed. And because
> > > inode reclaim uses lockless rcu lookups, there's substantial
> > > danger of adding rcu callback related deadlocks to XFS here.
> > > That's just not a direction we should be moving in.
> > 
> > Another reason I decided to use the ECHILD return instead is that
> > I thought synchronise_rcu() might add an unexpected delay.
> 
> It depends where you put the synchronise_rcu() call. :)
> 
> > Since synchronise_rcu() will only wait for processes that
> > currently have the rcu read lock do you think that could actually
> > be a problem in this code path?
> 
> No, I don't think it will.  The inode recycle case in XFS inode
> lookup can trigger in two cases:
> 
> 1. VFS cache eviction followed by immediate lookup
> 2. Inode has been unlinked and evicted, then free and reallocated by
> the filesytsem.
> 
> In case #1, that's a cold cache lookup and hence delays are
> acceptible (e.g. a slightly longer delay might result in having to
> fetch the inode from disk again). Calling synchronise_rcu() in this
> case is not going to be any different from having to fetch the inode
> from disk...
> 
> In case #2, there's a *lot* of CPU work being done to modify
> metadata (inode btree updates, etc), and so the operations can block
> on journal space, metadata IO, etc. Delays are acceptible, and could
> be in the order of hundreds of milliseconds if the transaction
> subsystem is bottlenecked. waiting for an RCU grace period when we
> reallocate an indoe immediately after freeing it isn't a big deal.
> 
> IOWs, if synchronize_rcu() turns out to be a problem, we can
> optimise that separately - we need to correct the inode reuse
> behaviour w.r.t. VFS RCU expectations, then we can optimise the
> result if there are perf problems stemming from correct behaviour.
> 

FWIW, with a fairly crude test on a high cpu count system, it's not that
difficult to reproduce an observable degradation in inode allocation
rate with a synchronous grace period in the inode reuse path, caused
purely by a lookup heavy workload on a completely separate filesystem.
The following is a 5m snapshot of the iget stats from a filesystem doing
allocs/frees with an external/heavy lookup workload (which not included
in the stats), with and without a sync grace period wait in the reuse
path:

baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
sync_rcu_test:	ig 2955 2588 0 367 0 383 2955

I think this is kind of the nature of RCU and why I'm not sure it's a
great idea to rely on update side synchronization in a codepath that
might want to scale/perform in certain workloads. I'm not totally sure
if this will be a problem for real users running real workloads or not,
or if this can be easily mitigated, whether it's all rcu or a cascading
effect, etc. This is just a quick test so that all probably requires
more test and analysis to discern.

> > > I'll also point out that this would require XFS inodes to pass
> > > through *two* rcu grace periods before the memory they hold could be
> > > freed because, as I mentioned, xfs inode reclaim uses rcu protected
> > > inode lookups and so relies on inodes to be freed by rcu callback...
> > > 
> > > > I tend to agree with Brian and Ian at this point: return -ECHILD
> > > > from
> > > > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> > > > implications are fully dealt with.  No way to fix this from VFS
> > > > alone.
> > > 
> > > I disagree from a fundamental process POV - this is just sweeping
> > > the issue under the table and leaving it for someone else to solve
> > > because the root cause of the inode re-use issue has not been
> > > identified. But to the person who architected the lockless XFS inode
> > > cache 15 years ago, it's pretty obvious, so let's just solve it now.
> > 
> > Sorry, I don't understand what you mean by the root cause not
> > being identified?
> 
> The whole approach of "we don't know how to fix the inode reuse case
> so disable it" implies that nobody has understood where in the reuse
> case the problem lies. i.e. "inode reuse" by itself is not the root
> cause of the problem.
> 

I don't think anybody suggested to disable inode reuse. My suggestion
was to disable rcu walk mode on symlinks as an incremental step because
the change to enable it appeared to be an undocumented side effect of an
unrelated optimization. There was no real mention of it in the commit
log for the get_link_inline() variant, no analysis that explains if or
why it's safe to enable, and it was historical behavior since this
change in get_link() API to expose rcu walk was introduced. It seems
fairly reasonable to me to put that logic back in place first (while
also providing a predictable/stable fix) before we get into the weeds of
doing the right things with rcu to re-enable it (whether that be
synchronize_rcu() or something else)...

> The root cause is "allowing an inode to be reused without waiting
> for an RCU grace period to expire". This might seem pedantic, but
> "without waiting for an rcu grace period to expire" is the important
> part of the problem (i.e. the bug), not the "allowing an inode to be
> reused" bit.
> 
> Once the RCU part of the problem is pointed out, the solution
> becomes obvious. As nobody had seen the obvious (wait for an RCU
> grace period when recycling an inode) it stands to reason that
> nobody really understood what the root cause of the inode reuse
> problem.
> 

The synchronize_rcu() approach was one of the first options discussed in
the bug report once a reproducer was available. It had been tested as a
potential option for the broader problem (should the vfs change turn out
problematic) before these patches landed on the list. It's a reasonable
option and reasonable to prefer it over the most recent patch, but as
noted above I think there are other factors at play beyond having a pure
enough understanding of the root cause or not.

AIUI, this is not currently a reproducible problem even before patch 1,
which reduces the race window even further. Given that and the nak on
the current patch (the justification for which I don't really
understand), I'm starting to agree with Ian's earlier statement that
perhaps it is best to separate this one so we can (hopefully) move patch
1 along on its own merit..

Brian

> > > With the xfs_ifree() problem solved by zeroing rather than freeing,
> > > then the only other problem is inode reuse *within an rcu grace
> > > period*. Immediate inode reuse tends to be rare, (we can actually
> > > trace occurrences to validate this assertion), and implementation
> > > wise reuse is isolated to a single function: xfs_iget_recycle().
> > > 
> > > xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
> > > that found the inode marks it as being reclaimed (preventing other
> > > lookups from finding it), then re-initialises the inode. This is
> > > what makes .get_link change in the middle of pathwalk - we're
> > > reinitialising the inode without waiting for the RCU grace period to
> > > expire.
> > 
> > Ok, good to know that, there's a lot of icache code to look
> > through, ;)
> 
> My point precisely. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Brian Foster Nov. 16, 2021, 4:04 p.m. UTC | #16
On Tue, Nov 16, 2021 at 11:12:13AM +0100, Miklos Szeredi wrote:
> On Tue, 16 Nov 2021 at 04:01, Dave Chinner <david@fromorbit.com> wrote:
> 
> > I *think* that just zeroing the buffer means the race condition
> > means the link resolves as either wholly intact, partially zeroed
> > with trailing zeros in the length, wholly zeroed or zero length.
> > Nothing will crash, the link string is always null terminated even
> > if the length is wrong, and so nothing bad should happen as a result
> > of zeroing the symlink buffer when it gets evicted from the VFS
> > inode cache after unlink.
> 
> That's my thinking.  However, modifying the buffer while it is being
> processed does seem pretty ugly, and I have to admit that I don't
> understand why this needs to be done in either XFS or EXT4.
> 

Agreed. I'm also not following what problem this is intended to solve..?

Hmm.. it looks to me that the ext4 code zeroes the symlink to
accommodate its own truncate/teardown code because it will access the
field via a structure to interpret it as a (empty?) data mapping. IOW,
it doesn't seem to have anything to do with the vfs or path
walks/lookups but rather is an internal implementation detail of ext4.
It would probably be best if somebody who knows ext4 better could
comment on that before we take anything from it. Of course, there is the
fact that ext4 doing this seemingly doesn't disturb/explode the vfs, but
really neither does the current XFS code so it's kind of hard to say
whether one approach is any more or less correct purely based on the
fact that the code exists.

Brian

> > The root cause is "allowing an inode to be reused without waiting
> > for an RCU grace period to expire". This might seem pedantic, but
> > "without waiting for an rcu grace period to expire" is the important
> > part of the problem (i.e. the bug), not the "allowing an inode to be
> > reused" bit.
> 
> Yes.
> 
> Thanks,
> Miklos
>
Dave Chinner Nov. 17, 2021, 12:22 a.m. UTC | #17
On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > If it isn't safe for ext4 to do that, then we have a general
> > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > link buffer instead of freeing it, just like ext4 does.
> > > > 
> > > > As a side issue, we really don't want to move what XFS does in
> > > > .destroy_inode to .free_inode because that then means we need to
> > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > That's just not a direction we should be moving in.
> > > 
> > > Another reason I decided to use the ECHILD return instead is that
> > > I thought synchronise_rcu() might add an unexpected delay.
> > 
> > It depends where you put the synchronise_rcu() call. :)
> > 
> > > Since synchronise_rcu() will only wait for processes that
> > > currently have the rcu read lock do you think that could actually
> > > be a problem in this code path?
> > 
> > No, I don't think it will.  The inode recycle case in XFS inode
> > lookup can trigger in two cases:
> > 
> > 1. VFS cache eviction followed by immediate lookup
> > 2. Inode has been unlinked and evicted, then free and reallocated by
> > the filesytsem.
> > 
> > In case #1, that's a cold cache lookup and hence delays are
> > acceptible (e.g. a slightly longer delay might result in having to
> > fetch the inode from disk again). Calling synchronise_rcu() in this
> > case is not going to be any different from having to fetch the inode
> > from disk...
> > 
> > In case #2, there's a *lot* of CPU work being done to modify
> > metadata (inode btree updates, etc), and so the operations can block
> > on journal space, metadata IO, etc. Delays are acceptible, and could
> > be in the order of hundreds of milliseconds if the transaction
> > subsystem is bottlenecked. waiting for an RCU grace period when we
> > reallocate an indoe immediately after freeing it isn't a big deal.
> > 
> > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > optimise that separately - we need to correct the inode reuse
> > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > result if there are perf problems stemming from correct behaviour.
> > 
> 
> FWIW, with a fairly crude test on a high cpu count system, it's not that
> difficult to reproduce an observable degradation in inode allocation
> rate with a synchronous grace period in the inode reuse path, caused
> purely by a lookup heavy workload on a completely separate filesystem.
>
> The following is a 5m snapshot of the iget stats from a filesystem doing
> allocs/frees with an external/heavy lookup workload (which not included
> in the stats), with and without a sync grace period wait in the reuse
> path:
> 
> baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
> sync_rcu_test:	ig 2955 2588 0 367 0 383 2955

The alloc/free part of the workload is a single threaded
create/unlink in a tight loop, yes?

This smells like a side effect of agressive reallocation of
just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
their unlink state written back to disk yet. i.e. this is a corner
case in #2 above where a small set of inodes is being repeated
allocated and freed by userspace and hence being agressively reused
and never needing to wait for IO. i.e. a tempfile workload
optimisation...

> I think this is kind of the nature of RCU and why I'm not sure it's a
> great idea to rely on update side synchronization in a codepath that
> might want to scale/perform in certain workloads.

The problem here is not update side synchronisation. Root cause is
aggressive reallocation of recently freed VFS inodes via physical
inode allocation algorithms. Unfortunately, the RCU grace period
requirements of the VFS inode life cycle dictate that we can't
aggressively re-allocate and reuse freed inodes like this. i.e.
reallocation of a just-freed inode also has to wait for an RCU grace
period to pass before the in memory inode can be re-instantiated as
a newly allocated inode.

(Hmmmm - I wonder if of the other filesystems might have similar
problems with physical inode reallocation inside a RCU grace period?
i.e. without inode instance re-use, the VFS could potentially see
multiple in-memory instances of the same physical inode at the same
time.)

> I'm not totally sure
> if this will be a problem for real users running real workloads or not,
> or if this can be easily mitigated, whether it's all rcu or a cascading
> effect, etc. This is just a quick test so that all probably requires
> more test and analysis to discern.

This looks like a similar problem to what busy extents address - we
can't reuse a newly freed extent until the transaction containing
the EFI/EFD hit stable storage (and the discard operation on the
range is complete). Hence while a newly freed extent is
marked free in the allocbt, they can't be reused until they are
released from the busy extent tree.

I can think of several ways to address this, but let me think on it
a bit more.  I suspect there's a trick we can use to avoid needing
synchronise_rcu() completely by using the spare radix tree tag and
rcu grace period state checks with get_state_synchronize_rcu() and
poll_state_synchronize_rcu() to clear the radix tree tags via a
periodic radix tree tag walk (i.e. allocation side polling for "can
we use this inode" rather than waiting for the grace period to
expire once an inode has been selected and allocated.)

> > > 
> > > Sorry, I don't understand what you mean by the root cause not
> > > being identified?
> > 
> > The whole approach of "we don't know how to fix the inode reuse case
> > so disable it" implies that nobody has understood where in the reuse
> > case the problem lies. i.e. "inode reuse" by itself is not the root
> > cause of the problem.
> > 
> 
> I don't think anybody suggested to disable inode reuse.

Nobody did, so that's not what I was refering to. I was refering to
the patches for and comments advocating disabling .get_link for RCU
pathwalk because of the apparently unsolved problems stemming from
inode reuse...

> > The root cause is "allowing an inode to be reused without waiting
> > for an RCU grace period to expire". This might seem pedantic, but
> > "without waiting for an rcu grace period to expire" is the important
> > part of the problem (i.e. the bug), not the "allowing an inode to be
> > reused" bit.
> > 
> > Once the RCU part of the problem is pointed out, the solution
> > becomes obvious. As nobody had seen the obvious (wait for an RCU
> > grace period when recycling an inode) it stands to reason that
> > nobody really understood what the root cause of the inode reuse
> > problem.
> > 
> 
> The synchronize_rcu() approach was one of the first options discussed in
> the bug report once a reproducer was available.

What bug report would that be? :/

It's not one that I've read, and I don't recall seeing a pointer to
it anywhere in the path posting. IOWs, whatever discussion happened
in a private distro bug report can't be assumed as "general
knowledge" in an upstream discussion...

> AIUI, this is not currently a reproducible problem even before patch 1,
> which reduces the race window even further. Given that and the nak on
> the current patch (the justification for which I don't really
> understand), I'm starting to agree with Ian's earlier statement that
> perhaps it is best to separate this one so we can (hopefully) move patch
> 1 along on its own merit..

*nod*

The problem seems pretty rare, the pathwalk patch makes it
even rarer, so I think they can be separated just fine.

Cheers,

Dave.
Ian Kent Nov. 17, 2021, 2:19 a.m. UTC | #18
On Wed, 2021-11-17 at 11:22 +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > pathwalk problem, not an XFS issue. But, as you say, it is
> > > > > safe
> > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > link buffer instead of freeing it, just like ext4 does.
> > > > > 
> > > > > As a side issue, we really don't want to move what XFS does
> > > > > in
> > > > > .destroy_inode to .free_inode because that then means we need
> > > > > to
> > > > > add synchronise_rcu() calls everywhere in XFS that might need
> > > > > to
> > > > > wait on inodes being inactivated and/or reclaimed. And
> > > > > because
> > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > That's just not a direction we should be moving in.
> > > > 
> > > > Another reason I decided to use the ECHILD return instead is
> > > > that
> > > > I thought synchronise_rcu() might add an unexpected delay.
> > > 
> > > It depends where you put the synchronise_rcu() call. :)
> > > 
> > > > Since synchronise_rcu() will only wait for processes that
> > > > currently have the rcu read lock do you think that could
> > > > actually
> > > > be a problem in this code path?
> > > 
> > > No, I don't think it will.  The inode recycle case in XFS inode
> > > lookup can trigger in two cases:
> > > 
> > > 1. VFS cache eviction followed by immediate lookup
> > > 2. Inode has been unlinked and evicted, then free and reallocated
> > > by
> > > the filesytsem.
> > > 
> > > In case #1, that's a cold cache lookup and hence delays are
> > > acceptible (e.g. a slightly longer delay might result in having
> > > to
> > > fetch the inode from disk again). Calling synchronise_rcu() in
> > > this
> > > case is not going to be any different from having to fetch the
> > > inode
> > > from disk...
> > > 
> > > In case #2, there's a *lot* of CPU work being done to modify
> > > metadata (inode btree updates, etc), and so the operations can
> > > block
> > > on journal space, metadata IO, etc. Delays are acceptible, and
> > > could
> > > be in the order of hundreds of milliseconds if the transaction
> > > subsystem is bottlenecked. waiting for an RCU grace period when
> > > we
> > > reallocate an indoe immediately after freeing it isn't a big
> > > deal.
> > > 
> > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > optimise that separately - we need to correct the inode reuse
> > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > result if there are perf problems stemming from correct
> > > behaviour.
> > > 
> > 
> > FWIW, with a fairly crude test on a high cpu count system, it's not
> > that
> > difficult to reproduce an observable degradation in inode
> > allocation
> > rate with a synchronous grace period in the inode reuse path,
> > caused
> > purely by a lookup heavy workload on a completely separate
> > filesystem.
> > 
> > The following is a 5m snapshot of the iget stats from a filesystem
> > doing
> > allocs/frees with an external/heavy lookup workload (which not
> > included
> > in the stats), with and without a sync grace period wait in the
> > reuse
> > path:
> > 
> > baseline:       ig 1337026 1331541 4 5485 0 5541 1337026
> > sync_rcu_test:  ig 2955 2588 0 367 0 383 2955
> 
> The alloc/free part of the workload is a single threaded
> create/unlink in a tight loop, yes?
> 
> This smells like a side effect of agressive reallocation of
> just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> their unlink state written back to disk yet. i.e. this is a corner
> case in #2 above where a small set of inodes is being repeated
> allocated and freed by userspace and hence being agressively reused
> and never needing to wait for IO. i.e. a tempfile workload
> optimisation...
> 
> > I think this is kind of the nature of RCU and why I'm not sure it's
> > a
> > great idea to rely on update side synchronization in a codepath
> > that
> > might want to scale/perform in certain workloads.
> 
> The problem here is not update side synchronisation. Root cause is
> aggressive reallocation of recently freed VFS inodes via physical
> inode allocation algorithms. Unfortunately, the RCU grace period
> requirements of the VFS inode life cycle dictate that we can't
> aggressively re-allocate and reuse freed inodes like this. i.e.
> reallocation of a just-freed inode also has to wait for an RCU grace
> period to pass before the in memory inode can be re-instantiated as
> a newly allocated inode.
> 
> (Hmmmm - I wonder if of the other filesystems might have similar
> problems with physical inode reallocation inside a RCU grace period?
> i.e. without inode instance re-use, the VFS could potentially see
> multiple in-memory instances of the same physical inode at the same
> time.)
> 
> > I'm not totally sure
> > if this will be a problem for real users running real workloads or
> > not,
> > or if this can be easily mitigated, whether it's all rcu or a
> > cascading
> > effect, etc. This is just a quick test so that all probably
> > requires
> > more test and analysis to discern.
> 
> This looks like a similar problem to what busy extents address - we
> can't reuse a newly freed extent until the transaction containing
> the EFI/EFD hit stable storage (and the discard operation on the
> range is complete). Hence while a newly freed extent is
> marked free in the allocbt, they can't be reused until they are
> released from the busy extent tree.
> 
> I can think of several ways to address this, but let me think on it
> a bit more.  I suspect there's a trick we can use to avoid needing
> synchronise_rcu() completely by using the spare radix tree tag and
> rcu grace period state checks with get_state_synchronize_rcu() and
> poll_state_synchronize_rcu() to clear the radix tree tags via a
> periodic radix tree tag walk (i.e. allocation side polling for "can
> we use this inode" rather than waiting for the grace period to
> expire once an inode has been selected and allocated.)

The synchronise_rcu() seems like it's too broad a brush.

It sounds like there are relatively simple ways to avoid the link
path race which I won't go into again but there's still a chance
inode re-use can cause confusion if done at the wrong time.

So it sounds like per-object (inode) granularity is needed for the
wait and that means answering the question "how do we know when it's
ok to re-use the inode" when we come to alloc the inode and want to
re-use one.

I don't know the answer to that question but introducing an XFS flag
to indicate the inode is in transition (or altering the meaning of an
an existing one) so we know to wait should be straight forward.

Perhaps the start of the rcu grace period for the object is a
suitable beginning then we just need to know if the grace period
for the object has expired to complete the wait ... possibly via
an xfs owned rcu call back on free to update the xfs flags ...

But I'm just thinking out loud here ...

There'd be a need to know when not to wait at all too ... mmm.

Ian
> 
> > > > 
> > > > Sorry, I don't understand what you mean by the root cause not
> > > > being identified?
> > > 
> > > The whole approach of "we don't know how to fix the inode reuse
> > > case
> > > so disable it" implies that nobody has understood where in the
> > > reuse
> > > case the problem lies. i.e. "inode reuse" by itself is not the
> > > root
> > > cause of the problem.
> > > 
> > 
> > I don't think anybody suggested to disable inode reuse.
> 
> Nobody did, so that's not what I was refering to. I was refering to
> the patches for and comments advocating disabling .get_link for RCU
> pathwalk because of the apparently unsolved problems stemming from
> inode reuse...
> 
> > > The root cause is "allowing an inode to be reused without waiting
> > > for an RCU grace period to expire". This might seem pedantic, but
> > > "without waiting for an rcu grace period to expire" is the
> > > important
> > > part of the problem (i.e. the bug), not the "allowing an inode to
> > > be
> > > reused" bit.
> > > 
> > > Once the RCU part of the problem is pointed out, the solution
> > > becomes obvious. As nobody had seen the obvious (wait for an RCU
> > > grace period when recycling an inode) it stands to reason that
> > > nobody really understood what the root cause of the inode reuse
> > > problem.
> > > 
> > 
> > The synchronize_rcu() approach was one of the first options
> > discussed in
> > the bug report once a reproducer was available.
> 
> What bug report would that be? :/
> 
> It's not one that I've read, and I don't recall seeing a pointer to
> it anywhere in the path posting. IOWs, whatever discussion happened
> in a private distro bug report can't be assumed as "general
> knowledge" in an upstream discussion...
> 
> > AIUI, this is not currently a reproducible problem even before
> > patch 1,
> > which reduces the race window even further. Given that and the nak
> > on
> > the current patch (the justification for which I don't really
> > understand), I'm starting to agree with Ian's earlier statement
> > that
> > perhaps it is best to separate this one so we can (hopefully) move
> > patch
> > 1 along on its own merit..
> 
> *nod*
> 
> The problem seems pretty rare, the pathwalk patch makes it
> even rarer, so I think they can be separated just fine.
> 
> Cheers,
> 
> Dave.
Brian Foster Nov. 17, 2021, 6:56 p.m. UTC | #19
On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > link buffer instead of freeing it, just like ext4 does.
> > > > > 
> > > > > As a side issue, we really don't want to move what XFS does in
> > > > > .destroy_inode to .free_inode because that then means we need to
> > > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > That's just not a direction we should be moving in.
> > > > 
> > > > Another reason I decided to use the ECHILD return instead is that
> > > > I thought synchronise_rcu() might add an unexpected delay.
> > > 
> > > It depends where you put the synchronise_rcu() call. :)
> > > 
> > > > Since synchronise_rcu() will only wait for processes that
> > > > currently have the rcu read lock do you think that could actually
> > > > be a problem in this code path?
> > > 
> > > No, I don't think it will.  The inode recycle case in XFS inode
> > > lookup can trigger in two cases:
> > > 
> > > 1. VFS cache eviction followed by immediate lookup
> > > 2. Inode has been unlinked and evicted, then free and reallocated by
> > > the filesytsem.
> > > 
> > > In case #1, that's a cold cache lookup and hence delays are
> > > acceptible (e.g. a slightly longer delay might result in having to
> > > fetch the inode from disk again). Calling synchronise_rcu() in this
> > > case is not going to be any different from having to fetch the inode
> > > from disk...
> > > 
> > > In case #2, there's a *lot* of CPU work being done to modify
> > > metadata (inode btree updates, etc), and so the operations can block
> > > on journal space, metadata IO, etc. Delays are acceptible, and could
> > > be in the order of hundreds of milliseconds if the transaction
> > > subsystem is bottlenecked. waiting for an RCU grace period when we
> > > reallocate an indoe immediately after freeing it isn't a big deal.
> > > 
> > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > optimise that separately - we need to correct the inode reuse
> > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > result if there are perf problems stemming from correct behaviour.
> > > 
> > 
> > FWIW, with a fairly crude test on a high cpu count system, it's not that
> > difficult to reproduce an observable degradation in inode allocation
> > rate with a synchronous grace period in the inode reuse path, caused
> > purely by a lookup heavy workload on a completely separate filesystem.
> >
> > The following is a 5m snapshot of the iget stats from a filesystem doing
> > allocs/frees with an external/heavy lookup workload (which not included
> > in the stats), with and without a sync grace period wait in the reuse
> > path:
> > 
> > baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
> > sync_rcu_test:	ig 2955 2588 0 367 0 383 2955
> 
> The alloc/free part of the workload is a single threaded
> create/unlink in a tight loop, yes?
> 
> This smells like a side effect of agressive reallocation of
> just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> their unlink state written back to disk yet. i.e. this is a corner
> case in #2 above where a small set of inodes is being repeated
> allocated and freed by userspace and hence being agressively reused
> and never needing to wait for IO. i.e. a tempfile workload
> optimisation...
> 

Yes, that was the point of the test.. to stress inode reuse against
known rcu activity.

> > I think this is kind of the nature of RCU and why I'm not sure it's a
> > great idea to rely on update side synchronization in a codepath that
> > might want to scale/perform in certain workloads.
> 
> The problem here is not update side synchronisation. Root cause is
> aggressive reallocation of recently freed VFS inodes via physical
> inode allocation algorithms. Unfortunately, the RCU grace period
> requirements of the VFS inode life cycle dictate that we can't
> aggressively re-allocate and reuse freed inodes like this. i.e.
> reallocation of a just-freed inode also has to wait for an RCU grace
> period to pass before the in memory inode can be re-instantiated as
> a newly allocated inode.
> 

I'm just showing that insertion of an synchronous rcu grace period wait
in the iget codepath is not without side effect, because that was the
proposal.

> (Hmmmm - I wonder if of the other filesystems might have similar
> problems with physical inode reallocation inside a RCU grace period?
> i.e. without inode instance re-use, the VFS could potentially see
> multiple in-memory instances of the same physical inode at the same
> time.)
> 
> > I'm not totally sure
> > if this will be a problem for real users running real workloads or not,
> > or if this can be easily mitigated, whether it's all rcu or a cascading
> > effect, etc. This is just a quick test so that all probably requires
> > more test and analysis to discern.
> 
> This looks like a similar problem to what busy extents address - we
> can't reuse a newly freed extent until the transaction containing
> the EFI/EFD hit stable storage (and the discard operation on the
> range is complete). Hence while a newly freed extent is
> marked free in the allocbt, they can't be reused until they are
> released from the busy extent tree.
> 
> I can think of several ways to address this, but let me think on it
> a bit more.  I suspect there's a trick we can use to avoid needing
> synchronise_rcu() completely by using the spare radix tree tag and
> rcu grace period state checks with get_state_synchronize_rcu() and
> poll_state_synchronize_rcu() to clear the radix tree tags via a
> periodic radix tree tag walk (i.e. allocation side polling for "can
> we use this inode" rather than waiting for the grace period to
> expire once an inode has been selected and allocated.)
> 

Yeah, and same. It's just a matter of how to break things down. I can
sort of see where you're going with the above, though I'm not totally
convinced that rcu gp polling is an advantage over explicit use of
existing infrastructure/apis. It seems more important that we avoid
overly crude things like sync waits in the alloc path vs. optimize away
potentially multiple async grace periods in the free path. Of course,
it's worth thinking about options regardless.

That said, is deferred inactivation still a thing? If so, then we've
already decided to defer/batch inactivations from the point the vfs
calls our ->destroy_inode() based on our own hueristic (which is likely
longer than a grace period already in most cases, making this even less
of an issue). That includes deferral of the physical free and inobt
updates, which means inode reuse can't occur until the inactivation
workqueue task runs. Only a single grace period is required to cover
(from the rcuwalk perspective) the entire set of inodes queued for
inactivation. That leaves at least a few fairly straightforward options:

1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
have to isolate the list to process first from the queueing context
rather than from workqueue context to ensure we don't process recently
added inodes that haven't sat for a grace period.

2. Drop a synchronize_rcu() in the workqueue task before it starts
processing items.

3. Incorporate something like the above with an rcu grace period cookie
to selectively process inodes (or batches thereof).

Options 1 and 2 both seem rather simple and unlikely to noticeably
impact behavior/performance. Option 3 seems a bit overkill to me, but is
certainly an option if the previous assertion around performance doesn't
hold true (particularly if we keep the tracking simple, such as
recording/enforcing the most recent gp of the set). Thoughts?

Brian

> > > > 
> > > > Sorry, I don't understand what you mean by the root cause not
> > > > being identified?
> > > 
> > > The whole approach of "we don't know how to fix the inode reuse case
> > > so disable it" implies that nobody has understood where in the reuse
> > > case the problem lies. i.e. "inode reuse" by itself is not the root
> > > cause of the problem.
> > > 
> > 
> > I don't think anybody suggested to disable inode reuse.
> 
> Nobody did, so that's not what I was refering to. I was refering to
> the patches for and comments advocating disabling .get_link for RCU
> pathwalk because of the apparently unsolved problems stemming from
> inode reuse...
> 
> > > The root cause is "allowing an inode to be reused without waiting
> > > for an RCU grace period to expire". This might seem pedantic, but
> > > "without waiting for an rcu grace period to expire" is the important
> > > part of the problem (i.e. the bug), not the "allowing an inode to be
> > > reused" bit.
> > > 
> > > Once the RCU part of the problem is pointed out, the solution
> > > becomes obvious. As nobody had seen the obvious (wait for an RCU
> > > grace period when recycling an inode) it stands to reason that
> > > nobody really understood what the root cause of the inode reuse
> > > problem.
> > > 
> > 
> > The synchronize_rcu() approach was one of the first options discussed in
> > the bug report once a reproducer was available.
> 
> What bug report would that be? :/
> 
> It's not one that I've read, and I don't recall seeing a pointer to
> it anywhere in the path posting. IOWs, whatever discussion happened
> in a private distro bug report can't be assumed as "general
> knowledge" in an upstream discussion...
> 
> > AIUI, this is not currently a reproducible problem even before patch 1,
> > which reduces the race window even further. Given that and the nak on
> > the current patch (the justification for which I don't really
> > understand), I'm starting to agree with Ian's earlier statement that
> > perhaps it is best to separate this one so we can (hopefully) move patch
> > 1 along on its own merit..
> 
> *nod*
> 
> The problem seems pretty rare, the pathwalk patch makes it
> even rarer, so I think they can be separated just fine.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
kernel test robot Nov. 17, 2021, 8:38 p.m. UTC | #20
Hi Ian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on mszeredi-vfs/overlayfs-next linus/master v5.16-rc1 next-20211117]
[cannot apply to djwong-xfs/djwong-devel]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ian-Kent/vfs-check-dentry-is-still-valid-in-get_link/20211111-114013
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: i386-randconfig-s001-20211115 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/1208c3b6210fbb49718cdf4fa5f7db35bea008f6
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ian-Kent/vfs-check-dentry-is-still-valid-in-get_link/20211111-114013
        git checkout 1208c3b6210fbb49718cdf4fa5f7db35bea008f6
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash fs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> fs/xfs/xfs_inode.c:2648:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> fs/xfs/xfs_inode.c:2648:17: sparse:    char [noderef] __rcu *
>> fs/xfs/xfs_inode.c:2648:17: sparse:    char *
--
>> fs/xfs/xfs_iops.c:531:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> fs/xfs/xfs_iops.c:531:16: sparse:    char [noderef] __rcu *
>> fs/xfs/xfs_iops.c:531:16: sparse:    char *

vim +2648 fs/xfs/xfs_inode.c

  2600	
  2601	/*
  2602	 * This is called to return an inode to the inode free list.
  2603	 * The inode should already be truncated to 0 length and have
  2604	 * no pages associated with it.  This routine also assumes that
  2605	 * the inode is already a part of the transaction.
  2606	 *
  2607	 * The on-disk copy of the inode will have been added to the list
  2608	 * of unlinked inodes in the AGI. We need to remove the inode from
  2609	 * that list atomically with respect to freeing it here.
  2610	 */
  2611	int
  2612	xfs_ifree(
  2613		struct xfs_trans	*tp,
  2614		struct xfs_inode	*ip)
  2615	{
  2616		struct xfs_mount	*mp = ip->i_mount;
  2617		struct xfs_perag	*pag;
  2618		struct xfs_icluster	xic = { 0 };
  2619		struct xfs_inode_log_item *iip = ip->i_itemp;
  2620		int			error;
  2621	
  2622		ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
  2623		ASSERT(VFS_I(ip)->i_nlink == 0);
  2624		ASSERT(ip->i_df.if_nextents == 0);
  2625		ASSERT(ip->i_disk_size == 0 || !S_ISREG(VFS_I(ip)->i_mode));
  2626		ASSERT(ip->i_nblocks == 0);
  2627	
  2628		pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
  2629	
  2630		/*
  2631		 * Pull the on-disk inode from the AGI unlinked list.
  2632		 */
  2633		error = xfs_iunlink_remove(tp, pag, ip);
  2634		if (error)
  2635			goto out;
  2636	
  2637		error = xfs_difree(tp, pag, ip->i_ino, &xic);
  2638		if (error)
  2639			goto out;
  2640	
  2641		/*
  2642		 * Free any local-format data sitting around before we reset the
  2643		 * data fork to extents format.  Note that the attr fork data has
  2644		 * already been freed by xfs_attr_inactive.
  2645		 */
  2646		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
  2647			kmem_free_rcu(ip->i_df.if_u1.if_data);
> 2648			RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
  2649			ip->i_df.if_bytes = 0;
  2650		}
  2651	
  2652		VFS_I(ip)->i_mode = 0;		/* mark incore inode as free */
  2653		ip->i_diflags = 0;
  2654		ip->i_diflags2 = mp->m_ino_geo.new_diflags2;
  2655		ip->i_forkoff = 0;		/* mark the attr fork not in use */
  2656		ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
  2657		if (xfs_iflags_test(ip, XFS_IPRESERVE_DM_FIELDS))
  2658			xfs_iflags_clear(ip, XFS_IPRESERVE_DM_FIELDS);
  2659	
  2660		/* Don't attempt to replay owner changes for a deleted inode */
  2661		spin_lock(&iip->ili_lock);
  2662		iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER);
  2663		spin_unlock(&iip->ili_lock);
  2664	
  2665		/*
  2666		 * Bump the generation count so no one will be confused
  2667		 * by reincarnations of this inode.
  2668		 */
  2669		VFS_I(ip)->i_generation++;
  2670		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
  2671	
  2672		if (xic.deleted)
  2673			error = xfs_ifree_cluster(tp, pag, ip, &xic);
  2674	out:
  2675		xfs_perag_put(pag);
  2676		return error;
  2677	}
  2678	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Dave Chinner Nov. 17, 2021, 9:06 p.m. UTC | #21
On Wed, Nov 17, 2021 at 10:19:46AM +0800, Ian Kent wrote:
> On Wed, 2021-11-17 at 11:22 +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > > pathwalk problem, not an XFS issue. But, as you say, it is
> > > > > > safe
> > > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > > link buffer instead of freeing it, just like ext4 does.
> > > > > > 
> > > > > > As a side issue, we really don't want to move what XFS does
> > > > > > in
> > > > > > .destroy_inode to .free_inode because that then means we need
> > > > > > to
> > > > > > add synchronise_rcu() calls everywhere in XFS that might need
> > > > > > to
> > > > > > wait on inodes being inactivated and/or reclaimed. And
> > > > > > because
> > > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > > That's just not a direction we should be moving in.
> > > > > 
> > > > > Another reason I decided to use the ECHILD return instead is
> > > > > that
> > > > > I thought synchronise_rcu() might add an unexpected delay.
> > > > 
> > > > It depends where you put the synchronise_rcu() call. :)
> > > > 
> > > > > Since synchronise_rcu() will only wait for processes that
> > > > > currently have the rcu read lock do you think that could
> > > > > actually
> > > > > be a problem in this code path?
> > > > 
> > > > No, I don't think it will.  The inode recycle case in XFS inode
> > > > lookup can trigger in two cases:
> > > > 
> > > > 1. VFS cache eviction followed by immediate lookup
> > > > 2. Inode has been unlinked and evicted, then free and reallocated
> > > > by
> > > > the filesytsem.
> > > > 
> > > > In case #1, that's a cold cache lookup and hence delays are
> > > > acceptible (e.g. a slightly longer delay might result in having
> > > > to
> > > > fetch the inode from disk again). Calling synchronise_rcu() in
> > > > this
> > > > case is not going to be any different from having to fetch the
> > > > inode
> > > > from disk...
> > > > 
> > > > In case #2, there's a *lot* of CPU work being done to modify
> > > > metadata (inode btree updates, etc), and so the operations can
> > > > block
> > > > on journal space, metadata IO, etc. Delays are acceptible, and
> > > > could
> > > > be in the order of hundreds of milliseconds if the transaction
> > > > subsystem is bottlenecked. waiting for an RCU grace period when
> > > > we
> > > > reallocate an indoe immediately after freeing it isn't a big
> > > > deal.
> > > > 
> > > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > > optimise that separately - we need to correct the inode reuse
> > > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > > result if there are perf problems stemming from correct
> > > > behaviour.
> > > > 
> > > 
> > > FWIW, with a fairly crude test on a high cpu count system, it's not
> > > that
> > > difficult to reproduce an observable degradation in inode
> > > allocation
> > > rate with a synchronous grace period in the inode reuse path,
> > > caused
> > > purely by a lookup heavy workload on a completely separate
> > > filesystem.
> > > 
> > > The following is a 5m snapshot of the iget stats from a filesystem
> > > doing
> > > allocs/frees with an external/heavy lookup workload (which not
> > > included
> > > in the stats), with and without a sync grace period wait in the
> > > reuse
> > > path:
> > > 
> > > baseline:       ig 1337026 1331541 4 5485 0 5541 1337026
> > > sync_rcu_test:  ig 2955 2588 0 367 0 383 2955
> > 
> > The alloc/free part of the workload is a single threaded
> > create/unlink in a tight loop, yes?
> > 
> > This smells like a side effect of agressive reallocation of
> > just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> > their unlink state written back to disk yet. i.e. this is a corner
> > case in #2 above where a small set of inodes is being repeated
> > allocated and freed by userspace and hence being agressively reused
> > and never needing to wait for IO. i.e. a tempfile workload
> > optimisation...
> > 
> > > I think this is kind of the nature of RCU and why I'm not sure it's
> > > a
> > > great idea to rely on update side synchronization in a codepath
> > > that
> > > might want to scale/perform in certain workloads.
> > 
> > The problem here is not update side synchronisation. Root cause is
> > aggressive reallocation of recently freed VFS inodes via physical
> > inode allocation algorithms. Unfortunately, the RCU grace period
> > requirements of the VFS inode life cycle dictate that we can't
> > aggressively re-allocate and reuse freed inodes like this. i.e.
> > reallocation of a just-freed inode also has to wait for an RCU grace
> > period to pass before the in memory inode can be re-instantiated as
> > a newly allocated inode.
> > 
> > (Hmmmm - I wonder if of the other filesystems might have similar
> > problems with physical inode reallocation inside a RCU grace period?
> > i.e. without inode instance re-use, the VFS could potentially see
> > multiple in-memory instances of the same physical inode at the same
> > time.)
> > 
> > > I'm not totally sure
> > > if this will be a problem for real users running real workloads or
> > > not,
> > > or if this can be easily mitigated, whether it's all rcu or a
> > > cascading
> > > effect, etc. This is just a quick test so that all probably
> > > requires
> > > more test and analysis to discern.
> > 
> > This looks like a similar problem to what busy extents address - we
> > can't reuse a newly freed extent until the transaction containing
> > the EFI/EFD hit stable storage (and the discard operation on the
> > range is complete). Hence while a newly freed extent is
> > marked free in the allocbt, they can't be reused until they are
> > released from the busy extent tree.
> > 
> > I can think of several ways to address this, but let me think on it
> > a bit more.  I suspect there's a trick we can use to avoid needing
> > synchronise_rcu() completely by using the spare radix tree tag and
> > rcu grace period state checks with get_state_synchronize_rcu() and
> > poll_state_synchronize_rcu() to clear the radix tree tags via a
> > periodic radix tree tag walk (i.e. allocation side polling for "can
> > we use this inode" rather than waiting for the grace period to
> > expire once an inode has been selected and allocated.)
> 
> The synchronise_rcu() seems like it's too broad a brush.

It has always been a big hammer. But correctness comes first, speed
second.

> It sounds like there are relatively simple ways to avoid the link
> path race which I won't go into again but there's still a chance
> inode re-use can cause confusion if done at the wrong time.
> 
> So it sounds like per-object (inode) granularity is needed for the
> wait and that means answering the question "how do we know when it's
> ok to re-use the inode" when we come to alloc the inode and want to
> re-use one.

When we free the inode, we simply mark it with a radix tree tag and
record the rcu grace sequence in the inode via
get_state_synchronize_rcu().  Then when allocation selects an inode
for allocation, we do a radix tree tag lookup on that inode number,
and if it is set we move to the next free inode in the finobt. Every
so often we sweep the radix tree clearing the tags for inodes with
expired grace periods, allowing them to be reallocated again. The
radix tree lookups during allocation would be fairly cheap
(lockless, read-only, just checking for a tag, not dereferencing the
slot) - I added two lookups on this tree per unlinked inode to turn
the list into a double linked list in memory and didn't see any
significant increase in overhead. If allocation succeeds then we are
going to insert/lookup the inode in that slot in the near future, so
we are going to take the hit of bringing that radix tree node into
CPU caches anyway...

> There'd be a need to know when not to wait at all too ... mmm.

Yup, that's what get_state_synchronize_rcu and
poll_state_synchronize_rcu give us.

Cheers,

Dave
Dave Chinner Nov. 17, 2021, 9:48 p.m. UTC | #22
On Wed, Nov 17, 2021 at 01:56:17PM -0500, Brian Foster wrote:
> On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > > link buffer instead of freeing it, just like ext4 does.
> > > > > > 
> > > > > > As a side issue, we really don't want to move what XFS does in
> > > > > > .destroy_inode to .free_inode because that then means we need to
> > > > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > > That's just not a direction we should be moving in.
> > > > > 
> > > > > Another reason I decided to use the ECHILD return instead is that
> > > > > I thought synchronise_rcu() might add an unexpected delay.
> > > > 
> > > > It depends where you put the synchronise_rcu() call. :)
> > > > 
> > > > > Since synchronise_rcu() will only wait for processes that
> > > > > currently have the rcu read lock do you think that could actually
> > > > > be a problem in this code path?
> > > > 
> > > > No, I don't think it will.  The inode recycle case in XFS inode
> > > > lookup can trigger in two cases:
> > > > 
> > > > 1. VFS cache eviction followed by immediate lookup
> > > > 2. Inode has been unlinked and evicted, then free and reallocated by
> > > > the filesytsem.
> > > > 
> > > > In case #1, that's a cold cache lookup and hence delays are
> > > > acceptible (e.g. a slightly longer delay might result in having to
> > > > fetch the inode from disk again). Calling synchronise_rcu() in this
> > > > case is not going to be any different from having to fetch the inode
> > > > from disk...
> > > > 
> > > > In case #2, there's a *lot* of CPU work being done to modify
> > > > metadata (inode btree updates, etc), and so the operations can block
> > > > on journal space, metadata IO, etc. Delays are acceptible, and could
> > > > be in the order of hundreds of milliseconds if the transaction
> > > > subsystem is bottlenecked. waiting for an RCU grace period when we
> > > > reallocate an indoe immediately after freeing it isn't a big deal.
> > > > 
> > > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > > optimise that separately - we need to correct the inode reuse
> > > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > > result if there are perf problems stemming from correct behaviour.
> > > > 
> > > 
> > > FWIW, with a fairly crude test on a high cpu count system, it's not that
> > > difficult to reproduce an observable degradation in inode allocation
> > > rate with a synchronous grace period in the inode reuse path, caused
> > > purely by a lookup heavy workload on a completely separate filesystem.
> > >
> > > The following is a 5m snapshot of the iget stats from a filesystem doing
> > > allocs/frees with an external/heavy lookup workload (which not included
> > > in the stats), with and without a sync grace period wait in the reuse
> > > path:
> > > 
> > > baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
> > > sync_rcu_test:	ig 2955 2588 0 367 0 383 2955
> > 
> > The alloc/free part of the workload is a single threaded
> > create/unlink in a tight loop, yes?
> > 
> > This smells like a side effect of agressive reallocation of
> > just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> > their unlink state written back to disk yet. i.e. this is a corner
> > case in #2 above where a small set of inodes is being repeated
> > allocated and freed by userspace and hence being agressively reused
> > and never needing to wait for IO. i.e. a tempfile workload
> > optimisation...
> > 
> 
> Yes, that was the point of the test.. to stress inode reuse against
> known rcu activity.
> 
> > > I think this is kind of the nature of RCU and why I'm not sure it's a
> > > great idea to rely on update side synchronization in a codepath that
> > > might want to scale/perform in certain workloads.
> > 
> > The problem here is not update side synchronisation. Root cause is
> > aggressive reallocation of recently freed VFS inodes via physical
> > inode allocation algorithms. Unfortunately, the RCU grace period
> > requirements of the VFS inode life cycle dictate that we can't
> > aggressively re-allocate and reuse freed inodes like this. i.e.
> > reallocation of a just-freed inode also has to wait for an RCU grace
> > period to pass before the in memory inode can be re-instantiated as
> > a newly allocated inode.
> > 
> 
> I'm just showing that insertion of an synchronous rcu grace period wait
> in the iget codepath is not without side effect, because that was the
> proposal.
> 
> > (Hmmmm - I wonder if of the other filesystems might have similar
> > problems with physical inode reallocation inside a RCU grace period?
> > i.e. without inode instance re-use, the VFS could potentially see
> > multiple in-memory instances of the same physical inode at the same
> > time.)
> > 
> > > I'm not totally sure
> > > if this will be a problem for real users running real workloads or not,
> > > or if this can be easily mitigated, whether it's all rcu or a cascading
> > > effect, etc. This is just a quick test so that all probably requires
> > > more test and analysis to discern.
> > 
> > This looks like a similar problem to what busy extents address - we
> > can't reuse a newly freed extent until the transaction containing
> > the EFI/EFD hit stable storage (and the discard operation on the
> > range is complete). Hence while a newly freed extent is
> > marked free in the allocbt, they can't be reused until they are
> > released from the busy extent tree.
> > 
> > I can think of several ways to address this, but let me think on it
> > a bit more.  I suspect there's a trick we can use to avoid needing
> > synchronise_rcu() completely by using the spare radix tree tag and
> > rcu grace period state checks with get_state_synchronize_rcu() and
> > poll_state_synchronize_rcu() to clear the radix tree tags via a
> > periodic radix tree tag walk (i.e. allocation side polling for "can
> > we use this inode" rather than waiting for the grace period to
> > expire once an inode has been selected and allocated.)
> > 
> 
> Yeah, and same. It's just a matter of how to break things down. I can
> sort of see where you're going with the above, though I'm not totally
> convinced that rcu gp polling is an advantage over explicit use of
> existing infrastructure/apis.

RCU gp polling is existing infrastructure/apis. It's used in several
places to explicitly elide unnecessary calls to
synchronise_rcu()....

> It seems more important that we avoid
> overly crude things like sync waits in the alloc path vs. optimize away
> potentially multiple async grace periods in the free path. Of course,
> it's worth thinking about options regardless.
> 
> That said, is deferred inactivation still a thing? If so, then we've

Already merged.

> already decided to defer/batch inactivations from the point the vfs
> calls our ->destroy_inode() based on our own hueristic (which is likely
> longer than a grace period already in most cases, making this even less
> of an issue).

No, Performance problems with large/long queues dictated a solution
in the other direction, into lockless, minimal depth, low delay
per-cpu deferred batching. IOWs, batch scheduling has significatly
faster scheduling requirements than RCU grace periods provide.

> That includes deferral of the physical free and inobt
> updates, which means inode reuse can't occur until the inactivation
> workqueue task runs.

Which can happen the moment the inode is queued for inactivation
on CONFIG_PREEMPT configs, long before a RCU grace period has
expired.

> Only a single grace period is required to cover
> (from the rcuwalk perspective) the entire set of inodes queued for
> inactivation. That leaves at least a few fairly straightforward options:
> 
> 1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
> have to isolate the list to process first from the queueing context
> rather than from workqueue context to ensure we don't process recently
> added inodes that haven't sat for a grace period.

No, that takes too long. Long queues simply mean deferred
inactivation is working on cold CPU caches and that means we take a
30-50% performance hit on inode eviction overhead for inodes that
need inactivation (e.g. unlinked inodes) just by having to load all
the inode state into CPU caches again.

Numbers I recorded at the time indicate that inactivation that
doesn't block on IO or the log typically takes between 200-500us
of CPU time, so the deferred batch sizes are sized to run about
10-15ms worth of deferred processing at a time. Filling a batch
takes memory reclaim about 200uS to fill when running
dispose_list() to evict inodes.

The problem with using RCU grace periods is that they delay the
start of the work for at least 10ms, sometimes hundreds of ms.
Using queue_rcu_work() means we will need to go back to unbound
depth queues to avoid blocking waiting for grace period expiry to
maintain perfomrance. THis means having tens of thousands of inodes
queued for inactivation before the workqueue starts running. These
are the sorts of numbers that caused all the problems Darrick was
having with performance, and that was all cold cache loading
overhead which is unavoidable with such large queue depths....

> 2. Drop a synchronize_rcu() in the workqueue task before it starts
> processing items.

Same problem. The per-cpu queue currently has a hard throttle at 256
inodes and that means nothing will be able to queue inodes for
inactivation on that CPU until the current RCU grace period expires.

> 3. Incorporate something like the above with an rcu grace period cookie
> to selectively process inodes (or batches thereof).

The use of lockless linked lists in the per-cpu queues makes that
difficult. Lockless dequeue requires removing the entire list from
the shared list head atomically, and it's a single linked list so we
can't selectively remove inodes from the list without a full
traversal. And selective removal from the list can't be done
locklessly.

We could, potentially, use a separate lockless queue for unlinked
inodes and defer that to after a grace period, but then rm -rf
workloads will go much, much slower.

Cheers,

Dave.
Brian Foster Nov. 19, 2021, 7:44 p.m. UTC | #23
On Thu, Nov 18, 2021 at 08:48:52AM +1100, Dave Chinner wrote:
> On Wed, Nov 17, 2021 at 01:56:17PM -0500, Brian Foster wrote:
> > On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> > > On Tue, Nov 16, 2021 at 10:59:05AM -0500, Brian Foster wrote:
> > > > On Tue, Nov 16, 2021 at 02:01:20PM +1100, Dave Chinner wrote:
> > > > > On Tue, Nov 16, 2021 at 09:03:31AM +0800, Ian Kent wrote:
> > > > > > On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> > > > > > > If it isn't safe for ext4 to do that, then we have a general
> > > > > > > pathwalk problem, not an XFS issue. But, as you say, it is safe
> > > > > > > to do this zeroing, so the fix to xfs_ifree() is to zero the
> > > > > > > link buffer instead of freeing it, just like ext4 does.
> > > > > > > 
> > > > > > > As a side issue, we really don't want to move what XFS does in
> > > > > > > .destroy_inode to .free_inode because that then means we need to
> > > > > > > add synchronise_rcu() calls everywhere in XFS that might need to
> > > > > > > wait on inodes being inactivated and/or reclaimed. And because
> > > > > > > inode reclaim uses lockless rcu lookups, there's substantial
> > > > > > > danger of adding rcu callback related deadlocks to XFS here.
> > > > > > > That's just not a direction we should be moving in.
> > > > > > 
> > > > > > Another reason I decided to use the ECHILD return instead is that
> > > > > > I thought synchronise_rcu() might add an unexpected delay.
> > > > > 
> > > > > It depends where you put the synchronise_rcu() call. :)
> > > > > 
> > > > > > Since synchronise_rcu() will only wait for processes that
> > > > > > currently have the rcu read lock do you think that could actually
> > > > > > be a problem in this code path?
> > > > > 
> > > > > No, I don't think it will.  The inode recycle case in XFS inode
> > > > > lookup can trigger in two cases:
> > > > > 
> > > > > 1. VFS cache eviction followed by immediate lookup
> > > > > 2. Inode has been unlinked and evicted, then free and reallocated by
> > > > > the filesytsem.
> > > > > 
> > > > > In case #1, that's a cold cache lookup and hence delays are
> > > > > acceptible (e.g. a slightly longer delay might result in having to
> > > > > fetch the inode from disk again). Calling synchronise_rcu() in this
> > > > > case is not going to be any different from having to fetch the inode
> > > > > from disk...
> > > > > 
> > > > > In case #2, there's a *lot* of CPU work being done to modify
> > > > > metadata (inode btree updates, etc), and so the operations can block
> > > > > on journal space, metadata IO, etc. Delays are acceptible, and could
> > > > > be in the order of hundreds of milliseconds if the transaction
> > > > > subsystem is bottlenecked. waiting for an RCU grace period when we
> > > > > reallocate an indoe immediately after freeing it isn't a big deal.
> > > > > 
> > > > > IOWs, if synchronize_rcu() turns out to be a problem, we can
> > > > > optimise that separately - we need to correct the inode reuse
> > > > > behaviour w.r.t. VFS RCU expectations, then we can optimise the
> > > > > result if there are perf problems stemming from correct behaviour.
> > > > > 
> > > > 
> > > > FWIW, with a fairly crude test on a high cpu count system, it's not that
> > > > difficult to reproduce an observable degradation in inode allocation
> > > > rate with a synchronous grace period in the inode reuse path, caused
> > > > purely by a lookup heavy workload on a completely separate filesystem.
> > > >
> > > > The following is a 5m snapshot of the iget stats from a filesystem doing
> > > > allocs/frees with an external/heavy lookup workload (which not included
> > > > in the stats), with and without a sync grace period wait in the reuse
> > > > path:
> > > > 
> > > > baseline:	ig 1337026 1331541 4 5485 0 5541 1337026
> > > > sync_rcu_test:	ig 2955 2588 0 367 0 383 2955
> > > 
> > > The alloc/free part of the workload is a single threaded
> > > create/unlink in a tight loop, yes?
> > > 
> > > This smells like a side effect of agressive reallocation of
> > > just-freed XFS_IRECLAIMABLE inodes from the finobt that haven't had
> > > their unlink state written back to disk yet. i.e. this is a corner
> > > case in #2 above where a small set of inodes is being repeated
> > > allocated and freed by userspace and hence being agressively reused
> > > and never needing to wait for IO. i.e. a tempfile workload
> > > optimisation...
> > > 
> > 
> > Yes, that was the point of the test.. to stress inode reuse against
> > known rcu activity.
> > 
> > > > I think this is kind of the nature of RCU and why I'm not sure it's a
> > > > great idea to rely on update side synchronization in a codepath that
> > > > might want to scale/perform in certain workloads.
> > > 
> > > The problem here is not update side synchronisation. Root cause is
> > > aggressive reallocation of recently freed VFS inodes via physical
> > > inode allocation algorithms. Unfortunately, the RCU grace period
> > > requirements of the VFS inode life cycle dictate that we can't
> > > aggressively re-allocate and reuse freed inodes like this. i.e.
> > > reallocation of a just-freed inode also has to wait for an RCU grace
> > > period to pass before the in memory inode can be re-instantiated as
> > > a newly allocated inode.
> > > 
> > 
> > I'm just showing that insertion of an synchronous rcu grace period wait
> > in the iget codepath is not without side effect, because that was the
> > proposal.
> > 
> > > (Hmmmm - I wonder if of the other filesystems might have similar
> > > problems with physical inode reallocation inside a RCU grace period?
> > > i.e. without inode instance re-use, the VFS could potentially see
> > > multiple in-memory instances of the same physical inode at the same
> > > time.)
> > > 
> > > > I'm not totally sure
> > > > if this will be a problem for real users running real workloads or not,
> > > > or if this can be easily mitigated, whether it's all rcu or a cascading
> > > > effect, etc. This is just a quick test so that all probably requires
> > > > more test and analysis to discern.
> > > 
> > > This looks like a similar problem to what busy extents address - we
> > > can't reuse a newly freed extent until the transaction containing
> > > the EFI/EFD hit stable storage (and the discard operation on the
> > > range is complete). Hence while a newly freed extent is
> > > marked free in the allocbt, they can't be reused until they are
> > > released from the busy extent tree.
> > > 
> > > I can think of several ways to address this, but let me think on it
> > > a bit more.  I suspect there's a trick we can use to avoid needing
> > > synchronise_rcu() completely by using the spare radix tree tag and
> > > rcu grace period state checks with get_state_synchronize_rcu() and
> > > poll_state_synchronize_rcu() to clear the radix tree tags via a
> > > periodic radix tree tag walk (i.e. allocation side polling for "can
> > > we use this inode" rather than waiting for the grace period to
> > > expire once an inode has been selected and allocated.)
> > > 
> > 
> > Yeah, and same. It's just a matter of how to break things down. I can
> > sort of see where you're going with the above, though I'm not totally
> > convinced that rcu gp polling is an advantage over explicit use of
> > existing infrastructure/apis.
> 
> RCU gp polling is existing infrastructure/apis. It's used in several
> places to explicitly elide unnecessary calls to
> synchronise_rcu()....
> 
> > It seems more important that we avoid
> > overly crude things like sync waits in the alloc path vs. optimize away
> > potentially multiple async grace periods in the free path. Of course,
> > it's worth thinking about options regardless.
> > 
> > That said, is deferred inactivation still a thing? If so, then we've
> 
> Already merged.
> 
> > already decided to defer/batch inactivations from the point the vfs
> > calls our ->destroy_inode() based on our own hueristic (which is likely
> > longer than a grace period already in most cases, making this even less
> > of an issue).
> 
> No, Performance problems with large/long queues dictated a solution
> in the other direction, into lockless, minimal depth, low delay
> per-cpu deferred batching. IOWs, batch scheduling has significatly
> faster scheduling requirements than RCU grace periods provide.
> 
> > That includes deferral of the physical free and inobt
> > updates, which means inode reuse can't occur until the inactivation
> > workqueue task runs.
> 
> Which can happen the moment the inode is queued for inactivation
> on CONFIG_PREEMPT configs, long before a RCU grace period has
> expired.
> 
> > Only a single grace period is required to cover
> > (from the rcuwalk perspective) the entire set of inodes queued for
> > inactivation. That leaves at least a few fairly straightforward options:
> > 
> > 1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
> > have to isolate the list to process first from the queueing context
> > rather than from workqueue context to ensure we don't process recently
> > added inodes that haven't sat for a grace period.
> 
> No, that takes too long. Long queues simply mean deferred
> inactivation is working on cold CPU caches and that means we take a
> 30-50% performance hit on inode eviction overhead for inodes that
> need inactivation (e.g. unlinked inodes) just by having to load all
> the inode state into CPU caches again.
> 
> Numbers I recorded at the time indicate that inactivation that
> doesn't block on IO or the log typically takes between 200-500us
> of CPU time, so the deferred batch sizes are sized to run about
> 10-15ms worth of deferred processing at a time. Filling a batch
> takes memory reclaim about 200uS to fill when running
> dispose_list() to evict inodes.
> 
> The problem with using RCU grace periods is that they delay the
> start of the work for at least 10ms, sometimes hundreds of ms.
> Using queue_rcu_work() means we will need to go back to unbound
> depth queues to avoid blocking waiting for grace period expiry to
> maintain perfomrance. THis means having tens of thousands of inodes
> queued for inactivation before the workqueue starts running. These
> are the sorts of numbers that caused all the problems Darrick was
> having with performance, and that was all cold cache loading
> overhead which is unavoidable with such large queue depths....
> 

Hm, Ok. I recall the large queue depth issues on the earlier versions
but had not caught up with the subsequent changes that limit (percpu)
batch size, etc. The cond sync rcu approach is easy enough to hack in
(i.e., sample gp on destroy, cond_sync_rcu() on inactivate) that I ran a
few experiments on opposing ends of the spectrum wrt to concurrency.

The short of it is that I do see about a 30% hit in the single threaded
sustained removal case with current batch sizes. If the workload scales
out to many (64) cpus, the impact dissipates, I suspect because we've
already distributed the workload across percpu wq tasks and thus drive
the rcu subsystem with context switches and other quiescent states that
progress grace periods. The single threaded perf hit mitigates at about
4x the current throttling threshold.

> > 2. Drop a synchronize_rcu() in the workqueue task before it starts
> > processing items.
> 
> Same problem. The per-cpu queue currently has a hard throttle at 256
> inodes and that means nothing will be able to queue inodes for
> inactivation on that CPU until the current RCU grace period expires.
> 

Yeah..

> > 3. Incorporate something like the above with an rcu grace period cookie
> > to selectively process inodes (or batches thereof).
> 
> The use of lockless linked lists in the per-cpu queues makes that
> difficult. Lockless dequeue requires removing the entire list from
> the shared list head atomically, and it's a single linked list so we
> can't selectively remove inodes from the list without a full
> traversal. And selective removal from the list can't be done
> locklessly.
> 

Sure, that's why I referred to "batches thereof."

> We could, potentially, use a separate lockless queue for unlinked
> inodes and defer that to after a grace period, but then rm -rf
> workloads will go much, much slower.
> 

I don't quite follow what you mean by a separate lockless queue..?  In
any event, another experiment I ran in light of the above results that
might be similar was to put the inode queueing component of
destroy_inode() behind an rcu callback. This reduces the single threaded
perf hit from the previous approach by about 50%. So not entirely
baseline performance, but it's back closer to baseline if I double the
throttle threshold (and actually faster at 4x). Perhaps my crude
prototype logic could be optimized further to not rely on percpu
threshold changes to match the baseline.

My overall takeaway from these couple hacky experiments is that the
unconditional synchronous rcu wait is indeed probably too heavy weight,
as you point out. The polling or callback (or perhaps your separate
queue) approach seems to be in the ballpark of viability, however,
particularly when we consider the behavior of scaled or mixed workloads
(since inactive queue processing seems to be size driven vs. latency
driven).

So I dunno.. if you consider the various severity and complexity
tradeoffs, this certainly seems worth more consideration to me. I can
think of other potentially interesting ways to experiment with
optimizing the above or perhaps tweak queueing to better facilitate
taking advantage of grace periods, but it's not worth going too far down
that road if you're wedded to the "busy inodes" approach.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Nov. 22, 2021, 12:08 a.m. UTC | #24
On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> On Thu, Nov 18, 2021 at 08:48:52AM +1100, Dave Chinner wrote:
> > On Wed, Nov 17, 2021 at 01:56:17PM -0500, Brian Foster wrote:
> > > On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> > > Only a single grace period is required to cover
> > > (from the rcuwalk perspective) the entire set of inodes queued for
> > > inactivation. That leaves at least a few fairly straightforward options:
> > > 
> > > 1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
> > > have to isolate the list to process first from the queueing context
> > > rather than from workqueue context to ensure we don't process recently
> > > added inodes that haven't sat for a grace period.
> > 
> > No, that takes too long. Long queues simply mean deferred
> > inactivation is working on cold CPU caches and that means we take a
> > 30-50% performance hit on inode eviction overhead for inodes that
> > need inactivation (e.g. unlinked inodes) just by having to load all
> > the inode state into CPU caches again.
> > 
> > Numbers I recorded at the time indicate that inactivation that
> > doesn't block on IO or the log typically takes between 200-500us
> > of CPU time, so the deferred batch sizes are sized to run about
> > 10-15ms worth of deferred processing at a time. Filling a batch
> > takes memory reclaim about 200uS to fill when running
> > dispose_list() to evict inodes.
> > 
> > The problem with using RCU grace periods is that they delay the
> > start of the work for at least 10ms, sometimes hundreds of ms.
> > Using queue_rcu_work() means we will need to go back to unbound
> > depth queues to avoid blocking waiting for grace period expiry to
> > maintain perfomrance. THis means having tens of thousands of inodes
> > queued for inactivation before the workqueue starts running. These
> > are the sorts of numbers that caused all the problems Darrick was
> > having with performance, and that was all cold cache loading
> > overhead which is unavoidable with such large queue depths....
> > 
> 
> Hm, Ok. I recall the large queue depth issues on the earlier versions
> but had not caught up with the subsequent changes that limit (percpu)
> batch size, etc. The cond sync rcu approach is easy enough to hack in
> (i.e., sample gp on destroy, cond_sync_rcu() on inactivate) that I ran a
> few experiments on opposing ends of the spectrum wrt to concurrency.
> 
> The short of it is that I do see about a 30% hit in the single threaded
> sustained removal case with current batch sizes. If the workload scales
> out to many (64) cpus, the impact dissipates, I suspect because we've
> already distributed the workload across percpu wq tasks and thus drive
> the rcu subsystem with context switches and other quiescent states that
> progress grace periods. The single threaded perf hit mitigates at about
> 4x the current throttling threshold.

I doubt that thread count increases are actually mitigating the perf
hit. Performance hits hard limits on concurrent rm -rf threads due
to CIL lock contention at 7-800,000 transactions/s
(hence the scalability patchset) regardless of the concurrency of
the workload. With that bottleneck removed, the system then hits
contention limits on VFS locks during
inode instantiation/reclaim. This typically happens at 1.1-1.2
million transactions/s during unlink.

Essentially, if you have a slower per-thread fs modification
workload, you can can increase the concurrency to more threads and
CPUs but the system will eventually still hit the same throughput
limits. Hence a per-thread performance degradataion will still reach
the same peak throughput levels, it will just take a few more
threads to reach that limit. IOWs, scale doesn't make the
per-thread degradation go away, it just allows more threads to run
at full (but degraded) performance before the scalabilty limit
threshold is hit.

> > We could, potentially, use a separate lockless queue for unlinked
> > inodes and defer that to after a grace period, but then rm -rf
> > workloads will go much, much slower.
> > 
> 
> I don't quite follow what you mean by a separate lockless queue..?

I was thinking separating unlinked symlinks into their own queue
that can be processed after a grace period....

> In
> any event, another experiment I ran in light of the above results that
> might be similar was to put the inode queueing component of
> destroy_inode() behind an rcu callback. This reduces the single threaded
> perf hit from the previous approach by about 50%. So not entirely
> baseline performance, but it's back closer to baseline if I double the
> throttle threshold (and actually faster at 4x). Perhaps my crude
> prototype logic could be optimized further to not rely on percpu
> threshold changes to match the baseline.
> 
> My overall takeaway from these couple hacky experiments is that the
> unconditional synchronous rcu wait is indeed probably too heavy weight,
> as you point out. The polling or callback (or perhaps your separate
> queue) approach seems to be in the ballpark of viability, however,
> particularly when we consider the behavior of scaled or mixed workloads
> (since inactive queue processing seems to be size driven vs. latency
> driven).
> 
> So I dunno.. if you consider the various severity and complexity
> tradeoffs, this certainly seems worth more consideration to me. I can
> think of other potentially interesting ways to experiment with
> optimizing the above or perhaps tweak queueing to better facilitate
> taking advantage of grace periods, but it's not worth going too far down
> that road if you're wedded to the "busy inodes" approach.

I'm not wedded to "busy inodes" but, as your experiments are
indicating, trying to handle rcu grace periods into the deferred
inactivation path isn't completely mitigating the impact of having
to wait for a grace period for recycling of inodes.

I suspect a rethink on the inode recycling mechanism is needed. THe
way it is currently implemented was a brute force solution - it is
simple and effective. However, we need more nuance in the recycling
logic now.  That is, if we are recycling an inode that is clean, has
nlink >=1 and has not been unlinked, it means the VFS evicted it too
soon and we are going to re-instantiate it as the identical inode
that was evicted from cache.

So how much re-initialisation do we actually need for that inode?
Almost everything in the inode is still valid; the problems come
from inode_init_always() resetting the entire internal inode state
and XFS then having to set them up again.  The internal state is
already largely correct when we start recycling, and the identity of
the recycled inode does not change when nlink >= 1. Hence eliding
inode_init_always() would also go a long way to avoiding the need
for a RCU grace period to pass before we can make the inode visible
to the VFS again.

If we can do that, then the only indoes that need a grace period
before they can be recycled are unlinked inodes as they change
identity when being recycled. That identity change absolutely
requires a grace period to expire before the new instantiation can
be made visible.  Given the arbitrary delay that this can introduce
for an inode allocation operation, it seems much better suited to
detecting busy inodes than waiting for a global OS state change to
occur...

Cheers,

Dave.
Brian Foster Nov. 22, 2021, 7:27 p.m. UTC | #25
On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > On Thu, Nov 18, 2021 at 08:48:52AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 17, 2021 at 01:56:17PM -0500, Brian Foster wrote:
> > > > On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
> > > > Only a single grace period is required to cover
> > > > (from the rcuwalk perspective) the entire set of inodes queued for
> > > > inactivation. That leaves at least a few fairly straightforward options:
> > > > 
> > > > 1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
> > > > have to isolate the list to process first from the queueing context
> > > > rather than from workqueue context to ensure we don't process recently
> > > > added inodes that haven't sat for a grace period.
> > > 
> > > No, that takes too long. Long queues simply mean deferred
> > > inactivation is working on cold CPU caches and that means we take a
> > > 30-50% performance hit on inode eviction overhead for inodes that
> > > need inactivation (e.g. unlinked inodes) just by having to load all
> > > the inode state into CPU caches again.
> > > 
> > > Numbers I recorded at the time indicate that inactivation that
> > > doesn't block on IO or the log typically takes between 200-500us
> > > of CPU time, so the deferred batch sizes are sized to run about
> > > 10-15ms worth of deferred processing at a time. Filling a batch
> > > takes memory reclaim about 200uS to fill when running
> > > dispose_list() to evict inodes.
> > > 
> > > The problem with using RCU grace periods is that they delay the
> > > start of the work for at least 10ms, sometimes hundreds of ms.
> > > Using queue_rcu_work() means we will need to go back to unbound
> > > depth queues to avoid blocking waiting for grace period expiry to
> > > maintain perfomrance. THis means having tens of thousands of inodes
> > > queued for inactivation before the workqueue starts running. These
> > > are the sorts of numbers that caused all the problems Darrick was
> > > having with performance, and that was all cold cache loading
> > > overhead which is unavoidable with such large queue depths....
> > > 
> > 
> > Hm, Ok. I recall the large queue depth issues on the earlier versions
> > but had not caught up with the subsequent changes that limit (percpu)
> > batch size, etc. The cond sync rcu approach is easy enough to hack in
> > (i.e., sample gp on destroy, cond_sync_rcu() on inactivate) that I ran a
> > few experiments on opposing ends of the spectrum wrt to concurrency.
> > 
> > The short of it is that I do see about a 30% hit in the single threaded
> > sustained removal case with current batch sizes. If the workload scales
> > out to many (64) cpus, the impact dissipates, I suspect because we've
> > already distributed the workload across percpu wq tasks and thus drive
> > the rcu subsystem with context switches and other quiescent states that
> > progress grace periods. The single threaded perf hit mitigates at about
> > 4x the current throttling threshold.
> 
> I doubt that thread count increases are actually mitigating the perf
> hit. Performance hits hard limits on concurrent rm -rf threads due
> to CIL lock contention at 7-800,000 transactions/s
> (hence the scalability patchset) regardless of the concurrency of
> the workload. With that bottleneck removed, the system then hits
> contention limits on VFS locks during
> inode instantiation/reclaim. This typically happens at 1.1-1.2
> million transactions/s during unlink.
> 
> Essentially, if you have a slower per-thread fs modification
> workload, you can can increase the concurrency to more threads and
> CPUs but the system will eventually still hit the same throughput
> limits. Hence a per-thread performance degradataion will still reach
> the same peak throughput levels, it will just take a few more
> threads to reach that limit. IOWs, scale doesn't make the
> per-thread degradation go away, it just allows more threads to run
> at full (but degraded) performance before the scalabilty limit
> threshold is hit.
> 

All I'm testing for here is how the prospective rcu tweak behaves at
scale. It introduces a noticeable hit with a single thread and no real
noticeable change at scale. FWIW, I see a similar result with a quick
test on top of the log scalability changes. I.e., I see a noticeable
improvement in concurrent rm -rf from a baseline kernel to baseline +
log scalability, and that improvement persists with the rcu hack added
on top.

> > > We could, potentially, use a separate lockless queue for unlinked
> > > inodes and defer that to after a grace period, but then rm -rf
> > > workloads will go much, much slower.
> > > 
> > 
> > I don't quite follow what you mean by a separate lockless queue..?
> 
> I was thinking separating unlinked symlinks into their own queue
> that can be processed after a grace period....
> 

Ok. I had a similar idea in mind, but rather than handle inode types
specially just split the single queue into multiple queues that can
pipeline and potentially mitigate rcu delays. This essentially takes the
previous optimization (that mitigated the hit by about 50%) a bit
further, but I'm not totally sure it's a win without testing it.

> > In
> > any event, another experiment I ran in light of the above results that
> > might be similar was to put the inode queueing component of
> > destroy_inode() behind an rcu callback. This reduces the single threaded
> > perf hit from the previous approach by about 50%. So not entirely
> > baseline performance, but it's back closer to baseline if I double the
> > throttle threshold (and actually faster at 4x). Perhaps my crude
> > prototype logic could be optimized further to not rely on percpu
> > threshold changes to match the baseline.
> > 
> > My overall takeaway from these couple hacky experiments is that the
> > unconditional synchronous rcu wait is indeed probably too heavy weight,
> > as you point out. The polling or callback (or perhaps your separate
> > queue) approach seems to be in the ballpark of viability, however,
> > particularly when we consider the behavior of scaled or mixed workloads
> > (since inactive queue processing seems to be size driven vs. latency
> > driven).
> > 
> > So I dunno.. if you consider the various severity and complexity
> > tradeoffs, this certainly seems worth more consideration to me. I can
> > think of other potentially interesting ways to experiment with
> > optimizing the above or perhaps tweak queueing to better facilitate
> > taking advantage of grace periods, but it's not worth going too far down
> > that road if you're wedded to the "busy inodes" approach.
> 
> I'm not wedded to "busy inodes" but, as your experiments are
> indicating, trying to handle rcu grace periods into the deferred
> inactivation path isn't completely mitigating the impact of having
> to wait for a grace period for recycling of inodes.
> 

What I'm seeing so far is that the impact seems to be limited to the
single threaded workload and largely mitigated by an increase in the
percpu throttle limit. IOW, it's not completely free right out of the
gate, but the impact seems isolated and potentially mitigated by
adjustment of the pipeline.

I realize the throttle is a percpu value, so that is what has me
wondering about some potential for gains in efficiency to try and get
more of that single-threaded performance back in other ways, or perhaps
enhancements that might be more broadly beneficial to deferred
inactivations in general (i.e. some form of adaptive throttling
thresholds to balance percpu thresholds against a global threshold).

> I suspect a rethink on the inode recycling mechanism is needed. THe
> way it is currently implemented was a brute force solution - it is
> simple and effective. However, we need more nuance in the recycling
> logic now.  That is, if we are recycling an inode that is clean, has
> nlink >=1 and has not been unlinked, it means the VFS evicted it too
> soon and we are going to re-instantiate it as the identical inode
> that was evicted from cache.
> 

Probably. How advantageous is inode memory reuse supposed to be in the
first place? I've more considered it a necessary side effect of broader
architecture (i.e. deferred reclaim, etc.) as opposed to a primary
optimization. I still see reuse occur with deferred inactivation, we
just end up cycling through the same set of inodes as they fall through
the queue rather than recycling the same one over and over. I'm sort of
wondering what the impact would be if we didn't do this at all (for the
new allocation case at least).

> So how much re-initialisation do we actually need for that inode?
> Almost everything in the inode is still valid; the problems come
> from inode_init_always() resetting the entire internal inode state
> and XFS then having to set them up again.  The internal state is
> already largely correct when we start recycling, and the identity of
> the recycled inode does not change when nlink >= 1. Hence eliding
> inode_init_always() would also go a long way to avoiding the need
> for a RCU grace period to pass before we can make the inode visible
> to the VFS again.
> 
> If we can do that, then the only indoes that need a grace period
> before they can be recycled are unlinked inodes as they change
> identity when being recycled. That identity change absolutely
> requires a grace period to expire before the new instantiation can
> be made visible.  Given the arbitrary delay that this can introduce
> for an inode allocation operation, it seems much better suited to
> detecting busy inodes than waiting for a global OS state change to
> occur...
> 

Maybe..? The experiments I've been doing are aimed at simplicity and
reducing the scope of the changes. Part of the reason for this is tbh
I'm not totally convinced we really need to do anything more complex
than preserve the inline symlink buffer one way or another (for example,
see the rfc patch below for an alternative to the inline symlink rcuwalk
disable patch). Maybe we should consider something like this anyways.

With busy inodes, we need to alter inode allocation to some degree to
accommodate. We can have (tens of?) thousands of inodes under the grace
period at any time based on current batching behavior, so it's not
totally evident to me that we won't end up with some of the same
fundamental issues to deal with here, just needing to be accommodated in
the inode allocation algorithm rather than the teardown sequence.

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 64b9bf334806..058e3fc69ff7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2644,7 +2644,7 @@ xfs_ifree(
 	 * already been freed by xfs_attr_inactive.
 	 */
 	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
-		kmem_free(ip->i_df.if_u1.if_data);
+		kfree_rcu(ip->i_df.if_u1.if_data);
 		ip->i_df.if_u1.if_data = NULL;
 		ip->i_df.if_bytes = 0;
 	}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..e98d7f10ba7d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -511,27 +511,6 @@ xfs_vn_get_link(
 	return ERR_PTR(error);
 }
 
-STATIC const char *
-xfs_vn_get_link_inline(
-	struct dentry		*dentry,
-	struct inode		*inode,
-	struct delayed_call	*done)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	char			*link;
-
-	ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
-
-	/*
-	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
-	 * if_data is junk.
-	 */
-	link = ip->i_df.if_u1.if_data;
-	if (XFS_IS_CORRUPT(ip->i_mount, !link))
-		return ERR_PTR(-EFSCORRUPTED);
-	return link;
-}
-
 static uint32_t
 xfs_stat_blksize(
 	struct xfs_inode	*ip)
@@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
-static const struct inode_operations xfs_inline_symlink_inode_operations = {
-	.get_link		= xfs_vn_get_link_inline,
-	.getattr		= xfs_vn_getattr,
-	.setattr		= xfs_vn_setattr,
-	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
-};
-
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(
@@ -1409,9 +1380,8 @@ xfs_setup_iops(
 		break;
 	case S_IFLNK:
 		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
-			inode->i_op = &xfs_inline_symlink_inode_operations;
-		else
-			inode->i_op = &xfs_symlink_inode_operations;
+			inode->i_link = ip->i_df.if_u1.if_data;
+		inode->i_op = &xfs_symlink_inode_operations;
 		break;
 	default:
 		inode->i_op = &xfs_inode_operations;
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index fc2c6a404647..20ec2f450c56 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -497,6 +497,7 @@ xfs_inactive_symlink(
 	 * do here in that case.
 	 */
 	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		WRITE_ONCE(VFS_I(ip)->i_link, NULL);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		return 0;
 	}
Dave Chinner Nov. 22, 2021, 11:26 p.m. UTC | #26
On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > > In
> > > any event, another experiment I ran in light of the above results that
> > > might be similar was to put the inode queueing component of
> > > destroy_inode() behind an rcu callback. This reduces the single threaded
> > > perf hit from the previous approach by about 50%. So not entirely
> > > baseline performance, but it's back closer to baseline if I double the
> > > throttle threshold (and actually faster at 4x). Perhaps my crude
> > > prototype logic could be optimized further to not rely on percpu
> > > threshold changes to match the baseline.
> > > 
> > > My overall takeaway from these couple hacky experiments is that the
> > > unconditional synchronous rcu wait is indeed probably too heavy weight,
> > > as you point out. The polling or callback (or perhaps your separate
> > > queue) approach seems to be in the ballpark of viability, however,
> > > particularly when we consider the behavior of scaled or mixed workloads
> > > (since inactive queue processing seems to be size driven vs. latency
> > > driven).
> > > 
> > > So I dunno.. if you consider the various severity and complexity
> > > tradeoffs, this certainly seems worth more consideration to me. I can
> > > think of other potentially interesting ways to experiment with
> > > optimizing the above or perhaps tweak queueing to better facilitate
> > > taking advantage of grace periods, but it's not worth going too far down
> > > that road if you're wedded to the "busy inodes" approach.
> > 
> > I'm not wedded to "busy inodes" but, as your experiments are
> > indicating, trying to handle rcu grace periods into the deferred
> > inactivation path isn't completely mitigating the impact of having
> > to wait for a grace period for recycling of inodes.
> > 
> 
> What I'm seeing so far is that the impact seems to be limited to the
> single threaded workload and largely mitigated by an increase in the
> percpu throttle limit. IOW, it's not completely free right out of the
> gate, but the impact seems isolated and potentially mitigated by
> adjustment of the pipeline.
> 
> I realize the throttle is a percpu value, so that is what has me
> wondering about some potential for gains in efficiency to try and get
> more of that single-threaded performance back in other ways, or perhaps
> enhancements that might be more broadly beneficial to deferred
> inactivations in general (i.e. some form of adaptive throttling
> thresholds to balance percpu thresholds against a global threshold).

I ran experiments on queue depth early on. Once we go over a few
tens of inodes we start to lose the "hot cache" effect and
performance starts to go backwards. By queue depths of hundreds,
we've lost all the hot cache and nothing else gets that performance
back because we can't avoid the latency of all the memory writes
from cache eviction and the followup memory loads that result.

Making the per-cpu queues longer or shorter based on global state
won't gain us anything. ALl it will do is slow down local operations
that don't otherwise need slowing down....

> > I suspect a rethink on the inode recycling mechanism is needed. THe
> > way it is currently implemented was a brute force solution - it is
> > simple and effective. However, we need more nuance in the recycling
> > logic now.  That is, if we are recycling an inode that is clean, has
> > nlink >=1 and has not been unlinked, it means the VFS evicted it too
> > soon and we are going to re-instantiate it as the identical inode
> > that was evicted from cache.
> > 
> 
> Probably. How advantageous is inode memory reuse supposed to be in the
> first place? I've more considered it a necessary side effect of broader
> architecture (i.e. deferred reclaim, etc.) as opposed to a primary
> optimization.

Yes, it's an architectural feature resulting from the filesystem
inode life cycle being different to the VFS inode life cycle. This
was inherited from Irix - it had separate inactivation vs reclaim
states and action steps for vnodes - inactivation occurred when the
vnode refcount went to zero, reclaim occurred when the vnode was to
be freed.

Architecturally, Linux doesn't have this two-step infrastructure; it
just has evict() that runs everything when the inode needs to be
reclaimed. Hence we hide the two-phase reclaim architecture of XFS
behind that, and so we always had this troublesome impedence
mismatch on Linux.

Thinking a bit about this, maybe there is a more integrated way to
handle this life cycle impedence mismatch by making the way we
interact with the linux inode cache to be more ...  Irix like. Linux
does actually give us a a callback when the last reference to an
inode goes away: .drop_inode()

i.e. Maybe we should look to triggering inactivation work from
->drop_inode instead of ->destroy_inode and hence always leaving
unreferenced, reclaimable inodes in the VFS cache on the LRU. i.e.
rather than hiding the two-phase XFS inode inactivation+reclaim
algorithm from the VFS, move it up into the VFS.  If we prevent
inodes from being reclaimed from the LRU until they have finished
inactivation and are clean (easy enough just by marking the inode as
dirty), that would allow us to immediately reclaim and free inodes
in evict() context. Integration with __wait_on_freeing_inode() would
like solve the RCU reuse/recycling issues.

There's more to it than just this, but perhaps the longer term
direction should be closer integration with the Linux inode cache
life cycle rather than trying to solve all these problems underneath
the VFS cache whilst still trying to play by it's rules...

> I still see reuse occur with deferred inactivation, we
> just end up cycling through the same set of inodes as they fall through
> the queue rather than recycling the same one over and over. I'm sort of
> wondering what the impact would be if we didn't do this at all (for the
> new allocation case at least).

We end up with a larger pool of free inodes in the finobt. This is
basically what my "busy inode check" proposal is based on - inodes
that we can't allocate without recycling just remain on the finobt
for longer before they can be used. This would be awful if we didn't
have the finobt to efficiently locate free inodes - the finobt
record iteration makes it pretty low overhead to scan inodes here.

> > So how much re-initialisation do we actually need for that inode?
> > Almost everything in the inode is still valid; the problems come
> > from inode_init_always() resetting the entire internal inode state
> > and XFS then having to set them up again.  The internal state is
> > already largely correct when we start recycling, and the identity of
> > the recycled inode does not change when nlink >= 1. Hence eliding
> > inode_init_always() would also go a long way to avoiding the need
> > for a RCU grace period to pass before we can make the inode visible
> > to the VFS again.
> > 
> > If we can do that, then the only indoes that need a grace period
> > before they can be recycled are unlinked inodes as they change
> > identity when being recycled. That identity change absolutely
> > requires a grace period to expire before the new instantiation can
> > be made visible.  Given the arbitrary delay that this can introduce
> > for an inode allocation operation, it seems much better suited to
> > detecting busy inodes than waiting for a global OS state change to
> > occur...
> > 
> 
> Maybe..? The experiments I've been doing are aimed at simplicity and
> reducing the scope of the changes. Part of the reason for this is tbh
> I'm not totally convinced we really need to do anything more complex
> than preserve the inline symlink buffer one way or another (for example,
> see the rfc patch below for an alternative to the inline symlink rcuwalk
> disable patch). Maybe we should consider something like this anyways.
> 
> With busy inodes, we need to alter inode allocation to some degree to
> accommodate. We can have (tens of?) thousands of inodes under the grace
> period at any time based on current batching behavior, so it's not
> totally evident to me that we won't end up with some of the same
> fundamental issues to deal with here, just needing to be accommodated in
> the inode allocation algorithm rather than the teardown sequence.

Sure, but the purpose of the allocation selection
policy is to select the best inode to allocate for the current
context.  The cost of not being able to use an inode immediately
needs to be factored into that allocation policy. i.e. if the
selected inode has an associated delay with it before it can be
reused and other free inodes don't, then we should not be selecting
the inode with a delay associcated with it.

This is exactly the reasoning and logic we use for busy extents.  We
only take the blocking penalty for resolving busy extent state if we
run out of free extents to search before we've found an allocation
candidate. I think it makes sense for inode allocation, too.

> --- 8< ---
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64b9bf334806..058e3fc69ff7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2644,7 +2644,7 @@ xfs_ifree(
>  	 * already been freed by xfs_attr_inactive.
>  	 */
>  	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> -		kmem_free(ip->i_df.if_u1.if_data);
> +		kfree_rcu(ip->i_df.if_u1.if_data);
>  		ip->i_df.if_u1.if_data = NULL;

That would need to be rcu_assign_pointer(ip->i_df.if_u1.if_data,
NULL) to put the correct memory barriers in place, right? Also, I
think ip->i_df.if_u1.if_data needs to be set to NULL before calling
kfree_rcu() so racing lookups will always see NULL before
the object is freed.

But again, as I asked up front, why do we even need to free this
memory buffer here? It will be freed in xfs_inode_free_callback()
after the current RCU grace period expires, so what do we gain by
freeing it separately here?

>  		ip->i_df.if_bytes = 0;
>  	}
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a607d6aca5c4..e98d7f10ba7d 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -511,27 +511,6 @@ xfs_vn_get_link(
>  	return ERR_PTR(error);
>  }
>  
> -STATIC const char *
> -xfs_vn_get_link_inline(
> -	struct dentry		*dentry,
> -	struct inode		*inode,
> -	struct delayed_call	*done)
> -{
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	char			*link;
> -
> -	ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> -
> -	/*
> -	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
> -	 * if_data is junk.
> -	 */
> -	link = ip->i_df.if_u1.if_data;
> -	if (XFS_IS_CORRUPT(ip->i_mount, !link))
> -		return ERR_PTR(-EFSCORRUPTED);
> -	return link;
> -}
> -
>  static uint32_t
>  xfs_stat_blksize(
>  	struct xfs_inode	*ip)
> @@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = {
>  	.update_time		= xfs_vn_update_time,
>  };
>  
> -static const struct inode_operations xfs_inline_symlink_inode_operations = {
> -	.get_link		= xfs_vn_get_link_inline,
> -	.getattr		= xfs_vn_getattr,
> -	.setattr		= xfs_vn_setattr,
> -	.listxattr		= xfs_vn_listxattr,
> -	.update_time		= xfs_vn_update_time,
> -};
> -
>  /* Figure out if this file actually supports DAX. */
>  static bool
>  xfs_inode_supports_dax(
> @@ -1409,9 +1380,8 @@ xfs_setup_iops(
>  		break;
>  	case S_IFLNK:
>  		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> -			inode->i_op = &xfs_inline_symlink_inode_operations;
> -		else
> -			inode->i_op = &xfs_symlink_inode_operations;
> +			inode->i_link = ip->i_df.if_u1.if_data;
> +		inode->i_op = &xfs_symlink_inode_operations;

This still needs corruption checks - ip->i_df.if_u1.if_data can be
null if there's some kind of inode corruption detected.

>  		break;
>  	default:
>  		inode->i_op = &xfs_inode_operations;
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index fc2c6a404647..20ec2f450c56 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -497,6 +497,7 @@ xfs_inactive_symlink(
>  	 * do here in that case.
>  	 */
>  	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> +		WRITE_ONCE(VFS_I(ip)->i_link, NULL);

Again, rcu_assign_pointer(), yes?

>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		return 0;
>  	}
> 
>
Brian Foster Nov. 24, 2021, 8:56 p.m. UTC | #27
On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > > > In
> > > > any event, another experiment I ran in light of the above results that
> > > > might be similar was to put the inode queueing component of
> > > > destroy_inode() behind an rcu callback. This reduces the single threaded
> > > > perf hit from the previous approach by about 50%. So not entirely
> > > > baseline performance, but it's back closer to baseline if I double the
> > > > throttle threshold (and actually faster at 4x). Perhaps my crude
> > > > prototype logic could be optimized further to not rely on percpu
> > > > threshold changes to match the baseline.
> > > > 
> > > > My overall takeaway from these couple hacky experiments is that the
> > > > unconditional synchronous rcu wait is indeed probably too heavy weight,
> > > > as you point out. The polling or callback (or perhaps your separate
> > > > queue) approach seems to be in the ballpark of viability, however,
> > > > particularly when we consider the behavior of scaled or mixed workloads
> > > > (since inactive queue processing seems to be size driven vs. latency
> > > > driven).
> > > > 
> > > > So I dunno.. if you consider the various severity and complexity
> > > > tradeoffs, this certainly seems worth more consideration to me. I can
> > > > think of other potentially interesting ways to experiment with
> > > > optimizing the above or perhaps tweak queueing to better facilitate
> > > > taking advantage of grace periods, but it's not worth going too far down
> > > > that road if you're wedded to the "busy inodes" approach.
> > > 
> > > I'm not wedded to "busy inodes" but, as your experiments are
> > > indicating, trying to handle rcu grace periods into the deferred
> > > inactivation path isn't completely mitigating the impact of having
> > > to wait for a grace period for recycling of inodes.
> > > 
> > 
> > What I'm seeing so far is that the impact seems to be limited to the
> > single threaded workload and largely mitigated by an increase in the
> > percpu throttle limit. IOW, it's not completely free right out of the
> > gate, but the impact seems isolated and potentially mitigated by
> > adjustment of the pipeline.
> > 
> > I realize the throttle is a percpu value, so that is what has me
> > wondering about some potential for gains in efficiency to try and get
> > more of that single-threaded performance back in other ways, or perhaps
> > enhancements that might be more broadly beneficial to deferred
> > inactivations in general (i.e. some form of adaptive throttling
> > thresholds to balance percpu thresholds against a global threshold).
> 
> I ran experiments on queue depth early on. Once we go over a few
> tens of inodes we start to lose the "hot cache" effect and
> performance starts to go backwards. By queue depths of hundreds,
> we've lost all the hot cache and nothing else gets that performance
> back because we can't avoid the latency of all the memory writes
> from cache eviction and the followup memory loads that result.
> 

Admittedly my testing is simple/crude as I'm just exploring the
potential viability of a concept, not fine tuning a workload, etc. That
said, I'm curious to know what your tests for this look like because I
suspect I'm running into different conditions. My tests frequently hit
the percpu throttle threshold (256 inodes), which is beyond your ideal
tens of inodes range (and probably more throttle limited than cpu cache
limited).

> Making the per-cpu queues longer or shorter based on global state
> won't gain us anything. ALl it will do is slow down local operations
> that don't otherwise need slowing down....
> 

This leaves out context. The increase in throttle threshold mitigates
the delays I've introduced via the rcu callback. That happens to produce
observable results comparable to my baseline test, but it's more of a
measure of the impact of the delay than a direct proposal. If there's a
more fine grained test worth running here (re: above), please describe
it.

> > > I suspect a rethink on the inode recycling mechanism is needed. THe
> > > way it is currently implemented was a brute force solution - it is
> > > simple and effective. However, we need more nuance in the recycling
> > > logic now.  That is, if we are recycling an inode that is clean, has
> > > nlink >=1 and has not been unlinked, it means the VFS evicted it too
> > > soon and we are going to re-instantiate it as the identical inode
> > > that was evicted from cache.
> > > 
> > 
> > Probably. How advantageous is inode memory reuse supposed to be in the
> > first place? I've more considered it a necessary side effect of broader
> > architecture (i.e. deferred reclaim, etc.) as opposed to a primary
> > optimization.
> 
> Yes, it's an architectural feature resulting from the filesystem
> inode life cycle being different to the VFS inode life cycle. This
> was inherited from Irix - it had separate inactivation vs reclaim
> states and action steps for vnodes - inactivation occurred when the
> vnode refcount went to zero, reclaim occurred when the vnode was to
> be freed.
> 
> Architecturally, Linux doesn't have this two-step infrastructure; it
> just has evict() that runs everything when the inode needs to be
> reclaimed. Hence we hide the two-phase reclaim architecture of XFS
> behind that, and so we always had this troublesome impedence
> mismatch on Linux.
> 

Ok, that was generally how I viewed it.

> Thinking a bit about this, maybe there is a more integrated way to
> handle this life cycle impedence mismatch by making the way we
> interact with the linux inode cache to be more ...  Irix like. Linux
> does actually give us a a callback when the last reference to an
> inode goes away: .drop_inode()
> 
> i.e. Maybe we should look to triggering inactivation work from
> ->drop_inode instead of ->destroy_inode and hence always leaving
> unreferenced, reclaimable inodes in the VFS cache on the LRU. i.e.
> rather than hiding the two-phase XFS inode inactivation+reclaim
> algorithm from the VFS, move it up into the VFS.  If we prevent
> inodes from being reclaimed from the LRU until they have finished
> inactivation and are clean (easy enough just by marking the inode as
> dirty), that would allow us to immediately reclaim and free inodes
> in evict() context. Integration with __wait_on_freeing_inode() would
> like solve the RCU reuse/recycling issues.
> 

Hmm.. this is the point where we decide whether the inode remains
cached, which is currently basically whether the inode has a link count
or not. That makes me curious what (can) happens with an
unlinked/inactivated inode on the lru. I'm not sure any other fs' do
anything like that currently..?

> There's more to it than just this, but perhaps the longer term
> direction should be closer integration with the Linux inode cache
> life cycle rather than trying to solve all these problems underneath
> the VFS cache whilst still trying to play by it's rules...
> 

Yeah. Caching logic details aside, I think that makes sense.

> > I still see reuse occur with deferred inactivation, we
> > just end up cycling through the same set of inodes as they fall through
> > the queue rather than recycling the same one over and over. I'm sort of
> > wondering what the impact would be if we didn't do this at all (for the
> > new allocation case at least).
> 
> We end up with a larger pool of free inodes in the finobt. This is
> basically what my "busy inode check" proposal is based on - inodes
> that we can't allocate without recycling just remain on the finobt
> for longer before they can be used. This would be awful if we didn't
> have the finobt to efficiently locate free inodes - the finobt
> record iteration makes it pretty low overhead to scan inodes here.
> 

I get the idea. That last bit is what I'm skeptical about. The finobt is
based on the premise that free inode lookup becomes a predictable tree
lookup instead of the old searching algorithm on the inobt, which we
still support and can be awful in its own right under worst case
conditions. I agree that this would be bad on the inobt (which raises
the question on how we'd provide these recycling correctness guarantees
on !finobt fs'). What I'm more concerned about is whether this could
make finobt enabled fs' (transiently) just as poor as the old algo under
certain workloads/conditions.

I think there needs to be at least some high level description of the
search algorithm before we can sufficiently reason about it's behavior..

> > > So how much re-initialisation do we actually need for that inode?
> > > Almost everything in the inode is still valid; the problems come
> > > from inode_init_always() resetting the entire internal inode state
> > > and XFS then having to set them up again.  The internal state is
> > > already largely correct when we start recycling, and the identity of
> > > the recycled inode does not change when nlink >= 1. Hence eliding
> > > inode_init_always() would also go a long way to avoiding the need
> > > for a RCU grace period to pass before we can make the inode visible
> > > to the VFS again.
> > > 
> > > If we can do that, then the only indoes that need a grace period
> > > before they can be recycled are unlinked inodes as they change
> > > identity when being recycled. That identity change absolutely
> > > requires a grace period to expire before the new instantiation can
> > > be made visible.  Given the arbitrary delay that this can introduce
> > > for an inode allocation operation, it seems much better suited to
> > > detecting busy inodes than waiting for a global OS state change to
> > > occur...
> > > 
> > 
> > Maybe..? The experiments I've been doing are aimed at simplicity and
> > reducing the scope of the changes. Part of the reason for this is tbh
> > I'm not totally convinced we really need to do anything more complex
> > than preserve the inline symlink buffer one way or another (for example,
> > see the rfc patch below for an alternative to the inline symlink rcuwalk
> > disable patch). Maybe we should consider something like this anyways.
> > 
> > With busy inodes, we need to alter inode allocation to some degree to
> > accommodate. We can have (tens of?) thousands of inodes under the grace
> > period at any time based on current batching behavior, so it's not
> > totally evident to me that we won't end up with some of the same
> > fundamental issues to deal with here, just needing to be accommodated in
> > the inode allocation algorithm rather than the teardown sequence.
> 
> Sure, but the purpose of the allocation selection
> policy is to select the best inode to allocate for the current
> context.  The cost of not being able to use an inode immediately
> needs to be factored into that allocation policy. i.e. if the
> selected inode has an associated delay with it before it can be
> reused and other free inodes don't, then we should not be selecting
> the inode with a delay associcated with it.
> 

We still have to find those "no delay" inodes. AFAICT the worst case
conditions on the system I've been playing with can have something like
20k free && busy inodes. That could cover all or most of the finobt at
the time of an inode allocation. What happens from there depends on the
search algorithm.

> This is exactly the reasoning and logic we use for busy extents.  We
> only take the blocking penalty for resolving busy extent state if we
> run out of free extents to search before we've found an allocation
> candidate. I think it makes sense for inode allocation, too.
> 

Sure, the idea makes sense and it's worth looking into. But there are
enough contextual differences that I wouldn't just assume the same logic
translates over to the finobt without potential for performance impact.
For example, extent allocation has some advantages with things like
delalloc (physical block allocation occurs async from buffered write
syscall time) and the fact that metadata allocs can reuse busy blocks.
The finobt only tracks existing chunks with free inodes, so it's easily
possible to have conditions where the finobt is 100% (or majority)
populated with busy inodes (whether it be one inode or several
thousand).

This raises questions like at what point does search cost become a
factor? At what point and with what frequency do we suffer the blocking
penalty? Do we opt to allocate new chunks based on gp state? Something
else? We don't need to answer these questions here (this thread is long
enough :P). I'm just trying to say that it's one thing to consider the
approach a viable option, but it isn't automatically preferable just
because we use it for extents. Further details beyond "detect busy
inodes" would be nice to objectively reason about.

> > --- 8< ---
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 64b9bf334806..058e3fc69ff7 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2644,7 +2644,7 @@ xfs_ifree(
> >  	 * already been freed by xfs_attr_inactive.
> >  	 */
> >  	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > -		kmem_free(ip->i_df.if_u1.if_data);
> > +		kfree_rcu(ip->i_df.if_u1.if_data);
> >  		ip->i_df.if_u1.if_data = NULL;
> 
> That would need to be rcu_assign_pointer(ip->i_df.if_u1.if_data,
> NULL) to put the correct memory barriers in place, right? Also, I
> think ip->i_df.if_u1.if_data needs to be set to NULL before calling
> kfree_rcu() so racing lookups will always see NULL before
> the object is freed.
> 

I think rcu_assign_pointer() is intended to be paired with the
associated rcu deref and for scenarios like making sure an object isn't
made available until it's completely initialized (i.e. such as for rcu
protected list traversals, etc.).

With regard to ordering, we no longer access if_data in rcuwalk mode
with this change. Thus I think all we need here is the
WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the vfs, and
that happens earlier in xfs_inactive_symlink() before we rcu free the
memory here. With that, ISTM a racing lookup should either see an rcu
protected i_link or NULL, the latter of which calls into ->get_link()
and triggers refwalk mode. Hm?

> But again, as I asked up front, why do we even need to free this
> memory buffer here? It will be freed in xfs_inode_free_callback()
> after the current RCU grace period expires, so what do we gain by
> freeing it separately here?
> 

One prevented memory leak? ;)

It won't be freed in xfs_inode_free_callback() because we change the
data fork format type (and clear i_mode) in this path. Perhaps that
could use an audit, but that's a separate issue.

> >  		ip->i_df.if_bytes = 0;
> >  	}
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index a607d6aca5c4..e98d7f10ba7d 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> >  	return ERR_PTR(error);
> >  }
> >  
> > -STATIC const char *
> > -xfs_vn_get_link_inline(
> > -	struct dentry		*dentry,
> > -	struct inode		*inode,
> > -	struct delayed_call	*done)
> > -{
> > -	struct xfs_inode	*ip = XFS_I(inode);
> > -	char			*link;
> > -
> > -	ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > -
> > -	/*
> > -	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
> > -	 * if_data is junk.
> > -	 */
> > -	link = ip->i_df.if_u1.if_data;
> > -	if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > -		return ERR_PTR(-EFSCORRUPTED);
> > -	return link;
> > -}
> > -
> >  static uint32_t
> >  xfs_stat_blksize(
> >  	struct xfs_inode	*ip)
> > @@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = {
> >  	.update_time		= xfs_vn_update_time,
> >  };
> >  
> > -static const struct inode_operations xfs_inline_symlink_inode_operations = {
> > -	.get_link		= xfs_vn_get_link_inline,
> > -	.getattr		= xfs_vn_getattr,
> > -	.setattr		= xfs_vn_setattr,
> > -	.listxattr		= xfs_vn_listxattr,
> > -	.update_time		= xfs_vn_update_time,
> > -};
> > -
> >  /* Figure out if this file actually supports DAX. */
> >  static bool
> >  xfs_inode_supports_dax(
> > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> >  		break;
> >  	case S_IFLNK:
> >  		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > -			inode->i_op = &xfs_inline_symlink_inode_operations;
> > -		else
> > -			inode->i_op = &xfs_symlink_inode_operations;
> > +			inode->i_link = ip->i_df.if_u1.if_data;
> > +		inode->i_op = &xfs_symlink_inode_operations;
> 
> This still needs corruption checks - ip->i_df.if_u1.if_data can be
> null if there's some kind of inode corruption detected.
> 

It's fine for i_link to be NULL. We'd just fall into the get_link() call
and have to handle it there like the current callback does.

However, this does need to restore some of the code removed from
xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
symlinks") to handle the local format case. If if_data can be NULL we'll
obviously need to handle it there anyways.

If there's no fundamental objection I'll address these issues, give it
some proper testing and send a real patch..

Brian

> >  		break;
> >  	default:
> >  		inode->i_op = &xfs_inode_operations;
> > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > index fc2c6a404647..20ec2f450c56 100644
> > --- a/fs/xfs/xfs_symlink.c
> > +++ b/fs/xfs/xfs_symlink.c
> > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> >  	 * do here in that case.
> >  	 */
> >  	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > +		WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> 
> Again, rcu_assign_pointer(), yes?
> 
> >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  		return 0;
> >  	}
> > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
>
Ian Kent Dec. 15, 2021, 3:54 a.m. UTC | #28
On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote:
> On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > > > > In
> > > > > any event, another experiment I ran in light of the above
> > > > > results that
> > > > > might be similar was to put the inode queueing component of
> > > > > destroy_inode() behind an rcu callback. This reduces the
> > > > > single threaded
> > > > > perf hit from the previous approach by about 50%. So not
> > > > > entirely
> > > > > baseline performance, but it's back closer to baseline if I
> > > > > double the
> > > > > throttle threshold (and actually faster at 4x). Perhaps my
> > > > > crude
> > > > > prototype logic could be optimized further to not rely on
> > > > > percpu
> > > > > threshold changes to match the baseline.
> > > > > 
> > > > > My overall takeaway from these couple hacky experiments is
> > > > > that the
> > > > > unconditional synchronous rcu wait is indeed probably too
> > > > > heavy weight,
> > > > > as you point out. The polling or callback (or perhaps your
> > > > > separate
> > > > > queue) approach seems to be in the ballpark of viability,
> > > > > however,
> > > > > particularly when we consider the behavior of scaled or mixed
> > > > > workloads
> > > > > (since inactive queue processing seems to be size driven vs.
> > > > > latency
> > > > > driven).
> > > > > 
> > > > > So I dunno.. if you consider the various severity and
> > > > > complexity
> > > > > tradeoffs, this certainly seems worth more consideration to
> > > > > me. I can
> > > > > think of other potentially interesting ways to experiment
> > > > > with
> > > > > optimizing the above or perhaps tweak queueing to better
> > > > > facilitate
> > > > > taking advantage of grace periods, but it's not worth going
> > > > > too far down
> > > > > that road if you're wedded to the "busy inodes" approach.
> > > > 
> > > > I'm not wedded to "busy inodes" but, as your experiments are
> > > > indicating, trying to handle rcu grace periods into the
> > > > deferred
> > > > inactivation path isn't completely mitigating the impact of
> > > > having
> > > > to wait for a grace period for recycling of inodes.
> > > > 
> > > 
> > > What I'm seeing so far is that the impact seems to be limited to
> > > the
> > > single threaded workload and largely mitigated by an increase in
> > > the
> > > percpu throttle limit. IOW, it's not completely free right out of
> > > the
> > > gate, but the impact seems isolated and potentially mitigated by
> > > adjustment of the pipeline.
> > > 
> > > I realize the throttle is a percpu value, so that is what has me
> > > wondering about some potential for gains in efficiency to try and
> > > get
> > > more of that single-threaded performance back in other ways, or
> > > perhaps
> > > enhancements that might be more broadly beneficial to deferred
> > > inactivations in general (i.e. some form of adaptive throttling
> > > thresholds to balance percpu thresholds against a global
> > > threshold).
> > 
> > I ran experiments on queue depth early on. Once we go over a few
> > tens of inodes we start to lose the "hot cache" effect and
> > performance starts to go backwards. By queue depths of hundreds,
> > we've lost all the hot cache and nothing else gets that performance
> > back because we can't avoid the latency of all the memory writes
> > from cache eviction and the followup memory loads that result.
> > 
> 
> Admittedly my testing is simple/crude as I'm just exploring the
> potential viability of a concept, not fine tuning a workload, etc.
> That
> said, I'm curious to know what your tests for this look like because
> I
> suspect I'm running into different conditions. My tests frequently
> hit
> the percpu throttle threshold (256 inodes), which is beyond your
> ideal
> tens of inodes range (and probably more throttle limited than cpu
> cache
> limited).
> 
> > Making the per-cpu queues longer or shorter based on global state
> > won't gain us anything. ALl it will do is slow down local
> > operations
> > that don't otherwise need slowing down....
> > 
> 
> This leaves out context. The increase in throttle threshold mitigates
> the delays I've introduced via the rcu callback. That happens to
> produce
> observable results comparable to my baseline test, but it's more of a
> measure of the impact of the delay than a direct proposal. If there's
> a
> more fine grained test worth running here (re: above), please
> describe
> it.
> 
> > > > I suspect a rethink on the inode recycling mechanism is needed.
> > > > THe
> > > > way it is currently implemented was a brute force solution - it
> > > > is
> > > > simple and effective. However, we need more nuance in the
> > > > recycling
> > > > logic now.  That is, if we are recycling an inode that is
> > > > clean, has
> > > > nlink >=1 and has not been unlinked, it means the VFS evicted
> > > > it too
> > > > soon and we are going to re-instantiate it as the identical
> > > > inode
> > > > that was evicted from cache.
> > > > 
> > > 
> > > Probably. How advantageous is inode memory reuse supposed to be
> > > in the
> > > first place? I've more considered it a necessary side effect of
> > > broader
> > > architecture (i.e. deferred reclaim, etc.) as opposed to a
> > > primary
> > > optimization.
> > 
> > Yes, it's an architectural feature resulting from the filesystem
> > inode life cycle being different to the VFS inode life cycle. This
> > was inherited from Irix - it had separate inactivation vs reclaim
> > states and action steps for vnodes - inactivation occurred when the
> > vnode refcount went to zero, reclaim occurred when the vnode was to
> > be freed.
> > 
> > Architecturally, Linux doesn't have this two-step infrastructure;
> > it
> > just has evict() that runs everything when the inode needs to be
> > reclaimed. Hence we hide the two-phase reclaim architecture of XFS
> > behind that, and so we always had this troublesome impedence
> > mismatch on Linux.
> > 
> 
> Ok, that was generally how I viewed it.
> 
> > Thinking a bit about this, maybe there is a more integrated way to
> > handle this life cycle impedence mismatch by making the way we
> > interact with the linux inode cache to be more ...  Irix like.
> > Linux
> > does actually give us a a callback when the last reference to an
> > inode goes away: .drop_inode()
> > 
> > i.e. Maybe we should look to triggering inactivation work from
> > ->drop_inode instead of ->destroy_inode and hence always leaving
> > unreferenced, reclaimable inodes in the VFS cache on the LRU. i.e.
> > rather than hiding the two-phase XFS inode inactivation+reclaim
> > algorithm from the VFS, move it up into the VFS.  If we prevent
> > inodes from being reclaimed from the LRU until they have finished
> > inactivation and are clean (easy enough just by marking the inode
> > as
> > dirty), that would allow us to immediately reclaim and free inodes
> > in evict() context. Integration with __wait_on_freeing_inode()
> > would
> > like solve the RCU reuse/recycling issues.
> > 
> 
> Hmm.. this is the point where we decide whether the inode remains
> cached, which is currently basically whether the inode has a link
> count
> or not. That makes me curious what (can) happens with an
> unlinked/inactivated inode on the lru. I'm not sure any other fs' do
> anything like that currently..?
> 
> > There's more to it than just this, but perhaps the longer term
> > direction should be closer integration with the Linux inode cache
> > life cycle rather than trying to solve all these problems
> > underneath
> > the VFS cache whilst still trying to play by it's rules...
> > 
> 
> Yeah. Caching logic details aside, I think that makes sense.
> 
> > > I still see reuse occur with deferred inactivation, we
> > > just end up cycling through the same set of inodes as they fall
> > > through
> > > the queue rather than recycling the same one over and over. I'm
> > > sort of
> > > wondering what the impact would be if we didn't do this at all
> > > (for the
> > > new allocation case at least).
> > 
> > We end up with a larger pool of free inodes in the finobt. This is
> > basically what my "busy inode check" proposal is based on - inodes
> > that we can't allocate without recycling just remain on the finobt
> > for longer before they can be used. This would be awful if we
> > didn't
> > have the finobt to efficiently locate free inodes - the finobt
> > record iteration makes it pretty low overhead to scan inodes here.
> > 
> 
> I get the idea. That last bit is what I'm skeptical about. The finobt
> is
> based on the premise that free inode lookup becomes a predictable
> tree
> lookup instead of the old searching algorithm on the inobt, which we
> still support and can be awful in its own right under worst case
> conditions. I agree that this would be bad on the inobt (which raises
> the question on how we'd provide these recycling correctness
> guarantees
> on !finobt fs'). What I'm more concerned about is whether this could
> make finobt enabled fs' (transiently) just as poor as the old algo
> under
> certain workloads/conditions.
> 
> I think there needs to be at least some high level description of the
> search algorithm before we can sufficiently reason about it's
> behavior..
> 
> > > > So how much re-initialisation do we actually need for that
> > > > inode?
> > > > Almost everything in the inode is still valid; the problems
> > > > come
> > > > from inode_init_always() resetting the entire internal inode
> > > > state
> > > > and XFS then having to set them up again.  The internal state
> > > > is
> > > > already largely correct when we start recycling, and the
> > > > identity of
> > > > the recycled inode does not change when nlink >= 1. Hence
> > > > eliding
> > > > inode_init_always() would also go a long way to avoiding the
> > > > need
> > > > for a RCU grace period to pass before we can make the inode
> > > > visible
> > > > to the VFS again.
> > > > 
> > > > If we can do that, then the only indoes that need a grace
> > > > period
> > > > before they can be recycled are unlinked inodes as they change
> > > > identity when being recycled. That identity change absolutely
> > > > requires a grace period to expire before the new instantiation
> > > > can
> > > > be made visible.  Given the arbitrary delay that this can
> > > > introduce
> > > > for an inode allocation operation, it seems much better suited
> > > > to
> > > > detecting busy inodes than waiting for a global OS state change
> > > > to
> > > > occur...
> > > > 
> > > 
> > > Maybe..? The experiments I've been doing are aimed at simplicity
> > > and
> > > reducing the scope of the changes. Part of the reason for this is
> > > tbh
> > > I'm not totally convinced we really need to do anything more
> > > complex
> > > than preserve the inline symlink buffer one way or another (for
> > > example,
> > > see the rfc patch below for an alternative to the inline symlink
> > > rcuwalk
> > > disable patch). Maybe we should consider something like this
> > > anyways.
> > > 
> > > With busy inodes, we need to alter inode allocation to some
> > > degree to
> > > accommodate. We can have (tens of?) thousands of inodes under the
> > > grace
> > > period at any time based on current batching behavior, so it's
> > > not
> > > totally evident to me that we won't end up with some of the same
> > > fundamental issues to deal with here, just needing to be
> > > accommodated in
> > > the inode allocation algorithm rather than the teardown sequence.
> > 
> > Sure, but the purpose of the allocation selection
> > policy is to select the best inode to allocate for the current
> > context.  The cost of not being able to use an inode immediately
> > needs to be factored into that allocation policy. i.e. if the
> > selected inode has an associated delay with it before it can be
> > reused and other free inodes don't, then we should not be selecting
> > the inode with a delay associcated with it.
> > 
> 
> We still have to find those "no delay" inodes. AFAICT the worst case
> conditions on the system I've been playing with can have something
> like
> 20k free && busy inodes. That could cover all or most of the finobt
> at
> the time of an inode allocation. What happens from there depends on
> the
> search algorithm.
> 
> > This is exactly the reasoning and logic we use for busy extents. 
> > We
> > only take the blocking penalty for resolving busy extent state if
> > we
> > run out of free extents to search before we've found an allocation
> > candidate. I think it makes sense for inode allocation, too.
> > 
> 
> Sure, the idea makes sense and it's worth looking into. But there are
> enough contextual differences that I wouldn't just assume the same
> logic
> translates over to the finobt without potential for performance
> impact.
> For example, extent allocation has some advantages with things like
> delalloc (physical block allocation occurs async from buffered write
> syscall time) and the fact that metadata allocs can reuse busy
> blocks.
> The finobt only tracks existing chunks with free inodes, so it's
> easily
> possible to have conditions where the finobt is 100% (or majority)
> populated with busy inodes (whether it be one inode or several
> thousand).
> 
> This raises questions like at what point does search cost become a
> factor? At what point and with what frequency do we suffer the
> blocking
> penalty? Do we opt to allocate new chunks based on gp state?
> Something
> else? We don't need to answer these questions here (this thread is
> long
> enough :P). I'm just trying to say that it's one thing to consider
> the
> approach a viable option, but it isn't automatically preferable just
> because we use it for extents. Further details beyond "detect busy
> inodes" would be nice to objectively reason about.
> 
> > > --- 8< ---
> > > 
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 64b9bf334806..058e3fc69ff7 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -2644,7 +2644,7 @@ xfs_ifree(
> > >          * already been freed by xfs_attr_inactive.
> > >          */
> > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > -               kmem_free(ip->i_df.if_u1.if_data);
> > > +               kfree_rcu(ip->i_df.if_u1.if_data);
> > >                 ip->i_df.if_u1.if_data = NULL;
> > 
> > That would need to be rcu_assign_pointer(ip->i_df.if_u1.if_data,
> > NULL) to put the correct memory barriers in place, right? Also, I
> > think ip->i_df.if_u1.if_data needs to be set to NULL before calling
> > kfree_rcu() so racing lookups will always see NULL before
> > the object is freed.
> > 
> 
> I think rcu_assign_pointer() is intended to be paired with the
> associated rcu deref and for scenarios like making sure an object
> isn't
> made available until it's completely initialized (i.e. such as for
> rcu
> protected list traversals, etc.).
> 
> With regard to ordering, we no longer access if_data in rcuwalk mode
> with this change. Thus I think all we need here is the
> WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the vfs,
> and
> that happens earlier in xfs_inactive_symlink() before we rcu free the
> memory here. With that, ISTM a racing lookup should either see an rcu
> protected i_link or NULL, the latter of which calls into ->get_link()
> and triggers refwalk mode. Hm?
> 
> > But again, as I asked up front, why do we even need to free this
> > memory buffer here? It will be freed in xfs_inode_free_callback()
> > after the current RCU grace period expires, so what do we gain by
> > freeing it separately here?
> > 

The thing that's been bugging me is not knowing if the VFS has
finished using the link string.

The link itself can be removed while the link path could still
be valid and the VFS will then walk that path. So rcu grace time
might not be sufficient but stashing the pointer and rcu freeing
it ensures the pointer won't go away while the walk is under way
without any grace period guessing. Relating this to the current
path walk via an rcu delayed free is a reliable way to do what's
needed.

The only problem here is that path string remaining while it is
being used. If there aren't any other known problems with the
xfs inode re-use sub system I don't see any worth in complicating
it to cater for this special case.

Brian's patch is a variation on the original patch and is all
that's really needed. IMHO going this way (whatever we end up
with) is the sensible thing to do.

Ian
> 
> One prevented memory leak? ;)
> 
> It won't be freed in xfs_inode_free_callback() because we change the
> data fork format type (and clear i_mode) in this path. Perhaps that
> could use an audit, but that's a separate issue.
> 
> > >                 ip->i_df.if_bytes = 0;
> > >         }
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a607d6aca5c4..e98d7f10ba7d 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> > >         return ERR_PTR(error);
> > >  }
> > >  
> > > -STATIC const char *
> > > -xfs_vn_get_link_inline(
> > > -       struct dentry           *dentry,
> > > -       struct inode            *inode,
> > > -       struct delayed_call     *done)
> > > -{
> > > -       struct xfs_inode        *ip = XFS_I(inode);
> > > -       char                    *link;
> > > -
> > > -       ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > > -
> > > -       /*
> > > -        * The VFS crashes on a NULL pointer, so return -
> > > EFSCORRUPTED if
> > > -        * if_data is junk.
> > > -        */
> > > -       link = ip->i_df.if_u1.if_data;
> > > -       if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > -               return ERR_PTR(-EFSCORRUPTED);
> > > -       return link;
> > > -}
> > > -
> > >  static uint32_t
> > >  xfs_stat_blksize(
> > >         struct xfs_inode        *ip)
> > > @@ -1250,14 +1229,6 @@ static const struct inode_operations
> > > xfs_symlink_inode_operations = {
> > >         .update_time            = xfs_vn_update_time,
> > >  };
> > >  
> > > -static const struct inode_operations
> > > xfs_inline_symlink_inode_operations = {
> > > -       .get_link               = xfs_vn_get_link_inline,
> > > -       .getattr                = xfs_vn_getattr,
> > > -       .setattr                = xfs_vn_setattr,
> > > -       .listxattr              = xfs_vn_listxattr,
> > > -       .update_time            = xfs_vn_update_time,
> > > -};
> > > -
> > >  /* Figure out if this file actually supports DAX. */
> > >  static bool
> > >  xfs_inode_supports_dax(
> > > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> > >                 break;
> > >         case S_IFLNK:
> > >                 if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > > -                       inode->i_op =
> > > &xfs_inline_symlink_inode_operations;
> > > -               else
> > > -                       inode->i_op =
> > > &xfs_symlink_inode_operations;
> > > +                       inode->i_link = ip->i_df.if_u1.if_data;
> > > +               inode->i_op = &xfs_symlink_inode_operations;
> > 
> > This still needs corruption checks - ip->i_df.if_u1.if_data can be
> > null if there's some kind of inode corruption detected.
> > 
> 
> It's fine for i_link to be NULL. We'd just fall into the get_link()
> call
> and have to handle it there like the current callback does.
> 
> However, this does need to restore some of the code removed from
> xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
> symlinks") to handle the local format case. If if_data can be NULL
> we'll
> obviously need to handle it there anyways.
> 
> If there's no fundamental objection I'll address these issues, give
> it
> some proper testing and send a real patch..
> 
> Brian
> 
> > >                 break;
> > >         default:
> > >                 inode->i_op = &xfs_inode_operations;
> > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > > index fc2c6a404647..20ec2f450c56 100644
> > > --- a/fs/xfs/xfs_symlink.c
> > > +++ b/fs/xfs/xfs_symlink.c
> > > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> > >          * do here in that case.
> > >          */
> > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > +               WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> > 
> > Again, rcu_assign_pointer(), yes?
> > 
> > >                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >                 return 0;
> > >         }
> > > 
> > > 
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Darrick J. Wong Dec. 15, 2021, 5:06 a.m. UTC | #29
On Wed, Dec 15, 2021 at 11:54:11AM +0800, Ian Kent wrote:
> On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote:
> > On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:
> > > > > > In
> > > > > > any event, another experiment I ran in light of the above
> > > > > > results that
> > > > > > might be similar was to put the inode queueing component of
> > > > > > destroy_inode() behind an rcu callback. This reduces the
> > > > > > single threaded
> > > > > > perf hit from the previous approach by about 50%. So not
> > > > > > entirely
> > > > > > baseline performance, but it's back closer to baseline if I
> > > > > > double the
> > > > > > throttle threshold (and actually faster at 4x). Perhaps my
> > > > > > crude
> > > > > > prototype logic could be optimized further to not rely on
> > > > > > percpu
> > > > > > threshold changes to match the baseline.
> > > > > > 
> > > > > > My overall takeaway from these couple hacky experiments is
> > > > > > that the
> > > > > > unconditional synchronous rcu wait is indeed probably too
> > > > > > heavy weight,
> > > > > > as you point out. The polling or callback (or perhaps your
> > > > > > separate
> > > > > > queue) approach seems to be in the ballpark of viability,
> > > > > > however,
> > > > > > particularly when we consider the behavior of scaled or mixed
> > > > > > workloads
> > > > > > (since inactive queue processing seems to be size driven vs.
> > > > > > latency
> > > > > > driven).
> > > > > > 
> > > > > > So I dunno.. if you consider the various severity and
> > > > > > complexity
> > > > > > tradeoffs, this certainly seems worth more consideration to
> > > > > > me. I can
> > > > > > think of other potentially interesting ways to experiment
> > > > > > with
> > > > > > optimizing the above or perhaps tweak queueing to better
> > > > > > facilitate
> > > > > > taking advantage of grace periods, but it's not worth going
> > > > > > too far down
> > > > > > that road if you're wedded to the "busy inodes" approach.
> > > > > 
> > > > > I'm not wedded to "busy inodes" but, as your experiments are
> > > > > indicating, trying to handle rcu grace periods into the
> > > > > deferred
> > > > > inactivation path isn't completely mitigating the impact of
> > > > > having
> > > > > to wait for a grace period for recycling of inodes.
> > > > > 
> > > > 
> > > > What I'm seeing so far is that the impact seems to be limited to
> > > > the
> > > > single threaded workload and largely mitigated by an increase in
> > > > the
> > > > percpu throttle limit. IOW, it's not completely free right out of
> > > > the
> > > > gate, but the impact seems isolated and potentially mitigated by
> > > > adjustment of the pipeline.
> > > > 
> > > > I realize the throttle is a percpu value, so that is what has me
> > > > wondering about some potential for gains in efficiency to try and
> > > > get
> > > > more of that single-threaded performance back in other ways, or
> > > > perhaps
> > > > enhancements that might be more broadly beneficial to deferred
> > > > inactivations in general (i.e. some form of adaptive throttling
> > > > thresholds to balance percpu thresholds against a global
> > > > threshold).
> > > 
> > > I ran experiments on queue depth early on. Once we go over a few
> > > tens of inodes we start to lose the "hot cache" effect and
> > > performance starts to go backwards. By queue depths of hundreds,
> > > we've lost all the hot cache and nothing else gets that performance
> > > back because we can't avoid the latency of all the memory writes
> > > from cache eviction and the followup memory loads that result.
> > > 
> > 
> > Admittedly my testing is simple/crude as I'm just exploring the
> > potential viability of a concept, not fine tuning a workload, etc.
> > That
> > said, I'm curious to know what your tests for this look like because
> > I
> > suspect I'm running into different conditions. My tests frequently
> > hit
> > the percpu throttle threshold (256 inodes), which is beyond your
> > ideal
> > tens of inodes range (and probably more throttle limited than cpu
> > cache
> > limited).
> > 
> > > Making the per-cpu queues longer or shorter based on global state
> > > won't gain us anything. ALl it will do is slow down local
> > > operations
> > > that don't otherwise need slowing down....
> > > 
> > 
> > This leaves out context. The increase in throttle threshold mitigates
> > the delays I've introduced via the rcu callback. That happens to
> > produce
> > observable results comparable to my baseline test, but it's more of a
> > measure of the impact of the delay than a direct proposal. If there's
> > a
> > more fine grained test worth running here (re: above), please
> > describe
> > it.
> > 
> > > > > I suspect a rethink on the inode recycling mechanism is needed.
> > > > > THe
> > > > > way it is currently implemented was a brute force solution - it
> > > > > is
> > > > > simple and effective. However, we need more nuance in the
> > > > > recycling
> > > > > logic now.  That is, if we are recycling an inode that is
> > > > > clean, has
> > > > > nlink >=1 and has not been unlinked, it means the VFS evicted
> > > > > it too
> > > > > soon and we are going to re-instantiate it as the identical
> > > > > inode
> > > > > that was evicted from cache.
> > > > > 
> > > > 
> > > > Probably. How advantageous is inode memory reuse supposed to be
> > > > in the
> > > > first place? I've more considered it a necessary side effect of
> > > > broader
> > > > architecture (i.e. deferred reclaim, etc.) as opposed to a
> > > > primary
> > > > optimization.
> > > 
> > > Yes, it's an architectural feature resulting from the filesystem
> > > inode life cycle being different to the VFS inode life cycle. This
> > > was inherited from Irix - it had separate inactivation vs reclaim
> > > states and action steps for vnodes - inactivation occurred when the
> > > vnode refcount went to zero, reclaim occurred when the vnode was to
> > > be freed.
> > > 
> > > Architecturally, Linux doesn't have this two-step infrastructure;
> > > it
> > > just has evict() that runs everything when the inode needs to be
> > > reclaimed. Hence we hide the two-phase reclaim architecture of XFS
> > > behind that, and so we always had this troublesome impedence
> > > mismatch on Linux.
> > > 
> > 
> > Ok, that was generally how I viewed it.
> > 
> > > Thinking a bit about this, maybe there is a more integrated way to
> > > handle this life cycle impedence mismatch by making the way we
> > > interact with the linux inode cache to be more ...  Irix like.
> > > Linux
> > > does actually give us a a callback when the last reference to an
> > > inode goes away: .drop_inode()
> > > 
> > > i.e. Maybe we should look to triggering inactivation work from
> > > ->drop_inode instead of ->destroy_inode and hence always leaving
> > > unreferenced, reclaimable inodes in the VFS cache on the LRU. i.e.
> > > rather than hiding the two-phase XFS inode inactivation+reclaim
> > > algorithm from the VFS, move it up into the VFS.  If we prevent
> > > inodes from being reclaimed from the LRU until they have finished
> > > inactivation and are clean (easy enough just by marking the inode
> > > as
> > > dirty), that would allow us to immediately reclaim and free inodes
> > > in evict() context. Integration with __wait_on_freeing_inode()
> > > would
> > > like solve the RCU reuse/recycling issues.
> > > 
> > 
> > Hmm.. this is the point where we decide whether the inode remains
> > cached, which is currently basically whether the inode has a link
> > count
> > or not. That makes me curious what (can) happens with an
> > unlinked/inactivated inode on the lru. I'm not sure any other fs' do
> > anything like that currently..?
> > 
> > > There's more to it than just this, but perhaps the longer term
> > > direction should be closer integration with the Linux inode cache
> > > life cycle rather than trying to solve all these problems
> > > underneath
> > > the VFS cache whilst still trying to play by it's rules...
> > > 
> > 
> > Yeah. Caching logic details aside, I think that makes sense.
> > 
> > > > I still see reuse occur with deferred inactivation, we
> > > > just end up cycling through the same set of inodes as they fall
> > > > through
> > > > the queue rather than recycling the same one over and over. I'm
> > > > sort of
> > > > wondering what the impact would be if we didn't do this at all
> > > > (for the
> > > > new allocation case at least).
> > > 
> > > We end up with a larger pool of free inodes in the finobt. This is
> > > basically what my "busy inode check" proposal is based on - inodes
> > > that we can't allocate without recycling just remain on the finobt
> > > for longer before they can be used. This would be awful if we
> > > didn't
> > > have the finobt to efficiently locate free inodes - the finobt
> > > record iteration makes it pretty low overhead to scan inodes here.
> > > 
> > 
> > I get the idea. That last bit is what I'm skeptical about. The finobt
> > is
> > based on the premise that free inode lookup becomes a predictable
> > tree
> > lookup instead of the old searching algorithm on the inobt, which we
> > still support and can be awful in its own right under worst case
> > conditions. I agree that this would be bad on the inobt (which raises
> > the question on how we'd provide these recycling correctness
> > guarantees
> > on !finobt fs'). What I'm more concerned about is whether this could
> > make finobt enabled fs' (transiently) just as poor as the old algo
> > under
> > certain workloads/conditions.
> > 
> > I think there needs to be at least some high level description of the
> > search algorithm before we can sufficiently reason about it's
> > behavior..
> > 
> > > > > So how much re-initialisation do we actually need for that
> > > > > inode?
> > > > > Almost everything in the inode is still valid; the problems
> > > > > come
> > > > > from inode_init_always() resetting the entire internal inode
> > > > > state
> > > > > and XFS then having to set them up again.  The internal state
> > > > > is
> > > > > already largely correct when we start recycling, and the
> > > > > identity of
> > > > > the recycled inode does not change when nlink >= 1. Hence
> > > > > eliding
> > > > > inode_init_always() would also go a long way to avoiding the
> > > > > need
> > > > > for a RCU grace period to pass before we can make the inode
> > > > > visible
> > > > > to the VFS again.
> > > > > 
> > > > > If we can do that, then the only indoes that need a grace
> > > > > period
> > > > > before they can be recycled are unlinked inodes as they change
> > > > > identity when being recycled. That identity change absolutely
> > > > > requires a grace period to expire before the new instantiation
> > > > > can
> > > > > be made visible.  Given the arbitrary delay that this can
> > > > > introduce
> > > > > for an inode allocation operation, it seems much better suited
> > > > > to
> > > > > detecting busy inodes than waiting for a global OS state change
> > > > > to
> > > > > occur...
> > > > > 
> > > > 
> > > > Maybe..? The experiments I've been doing are aimed at simplicity
> > > > and
> > > > reducing the scope of the changes. Part of the reason for this is
> > > > tbh
> > > > I'm not totally convinced we really need to do anything more
> > > > complex
> > > > than preserve the inline symlink buffer one way or another (for
> > > > example,
> > > > see the rfc patch below for an alternative to the inline symlink
> > > > rcuwalk
> > > > disable patch). Maybe we should consider something like this
> > > > anyways.
> > > > 
> > > > With busy inodes, we need to alter inode allocation to some
> > > > degree to
> > > > accommodate. We can have (tens of?) thousands of inodes under the
> > > > grace
> > > > period at any time based on current batching behavior, so it's
> > > > not
> > > > totally evident to me that we won't end up with some of the same
> > > > fundamental issues to deal with here, just needing to be
> > > > accommodated in
> > > > the inode allocation algorithm rather than the teardown sequence.
> > > 
> > > Sure, but the purpose of the allocation selection
> > > policy is to select the best inode to allocate for the current
> > > context.  The cost of not being able to use an inode immediately
> > > needs to be factored into that allocation policy. i.e. if the
> > > selected inode has an associated delay with it before it can be
> > > reused and other free inodes don't, then we should not be selecting
> > > the inode with a delay associcated with it.
> > > 
> > 
> > We still have to find those "no delay" inodes. AFAICT the worst case
> > conditions on the system I've been playing with can have something
> > like
> > 20k free && busy inodes. That could cover all or most of the finobt
> > at
> > the time of an inode allocation. What happens from there depends on
> > the
> > search algorithm.
> > 
> > > This is exactly the reasoning and logic we use for busy extents. 
> > > We
> > > only take the blocking penalty for resolving busy extent state if
> > > we
> > > run out of free extents to search before we've found an allocation
> > > candidate. I think it makes sense for inode allocation, too.
> > > 
> > 
> > Sure, the idea makes sense and it's worth looking into. But there are
> > enough contextual differences that I wouldn't just assume the same
> > logic
> > translates over to the finobt without potential for performance
> > impact.
> > For example, extent allocation has some advantages with things like
> > delalloc (physical block allocation occurs async from buffered write
> > syscall time) and the fact that metadata allocs can reuse busy
> > blocks.
> > The finobt only tracks existing chunks with free inodes, so it's
> > easily
> > possible to have conditions where the finobt is 100% (or majority)
> > populated with busy inodes (whether it be one inode or several
> > thousand).
> > 
> > This raises questions like at what point does search cost become a
> > factor? At what point and with what frequency do we suffer the
> > blocking
> > penalty? Do we opt to allocate new chunks based on gp state?
> > Something
> > else? We don't need to answer these questions here (this thread is
> > long
> > enough :P). I'm just trying to say that it's one thing to consider
> > the
> > approach a viable option, but it isn't automatically preferable just
> > because we use it for extents. Further details beyond "detect busy
> > inodes" would be nice to objectively reason about.
> > 
> > > > --- 8< ---
> > > > 
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 64b9bf334806..058e3fc69ff7 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -2644,7 +2644,7 @@ xfs_ifree(
> > > >          * already been freed by xfs_attr_inactive.
> > > >          */
> > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > -               kmem_free(ip->i_df.if_u1.if_data);
> > > > +               kfree_rcu(ip->i_df.if_u1.if_data);
> > > >                 ip->i_df.if_u1.if_data = NULL;
> > > 
> > > That would need to be rcu_assign_pointer(ip->i_df.if_u1.if_data,
> > > NULL) to put the correct memory barriers in place, right? Also, I
> > > think ip->i_df.if_u1.if_data needs to be set to NULL before calling
> > > kfree_rcu() so racing lookups will always see NULL before
> > > the object is freed.
> > > 
> > 
> > I think rcu_assign_pointer() is intended to be paired with the
> > associated rcu deref and for scenarios like making sure an object
> > isn't
> > made available until it's completely initialized (i.e. such as for
> > rcu
> > protected list traversals, etc.).
> > 
> > With regard to ordering, we no longer access if_data in rcuwalk mode
> > with this change. Thus I think all we need here is the
> > WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the vfs,
> > and
> > that happens earlier in xfs_inactive_symlink() before we rcu free the
> > memory here. With that, ISTM a racing lookup should either see an rcu
> > protected i_link or NULL, the latter of which calls into ->get_link()
> > and triggers refwalk mode. Hm?
> > 
> > > But again, as I asked up front, why do we even need to free this
> > > memory buffer here? It will be freed in xfs_inode_free_callback()
> > > after the current RCU grace period expires, so what do we gain by
> > > freeing it separately here?
> > > 
> 
> The thing that's been bugging me is not knowing if the VFS has
> finished using the link string.
> 
> The link itself can be removed while the link path could still
> be valid and the VFS will then walk that path. So rcu grace time
> might not be sufficient but stashing the pointer and rcu freeing
> it ensures the pointer won't go away while the walk is under way
> without any grace period guessing. Relating this to the current
> path walk via an rcu delayed free is a reliable way to do what's
> needed.
> 
> The only problem here is that path string remaining while it is
> being used. If there aren't any other known problems with the
> xfs inode re-use sub system I don't see any worth in complicating
> it to cater for this special case.
> 
> Brian's patch is a variation on the original patch and is all
> that's really needed. IMHO going this way (whatever we end up
> with) is the sensible thing to do.

Why not simply change xfs_readlink to memcpy ip->i_df.if_u1.if_data into
the caller's link buffer?  We shouldn't be exposing internal XFS
metadata buffers to the VFS to scribble on without telling us; this gets
rid of the dual inode_operations for symlinks; and we're probably
breaking a locking rule somewhere by not taking any locks AFAICT.  That
seems a lot less complex than adding rcu freeing rules to understand how
to handle local format file forks.

(I say this from the vantage point of online repair, which will try to
salvage damaged symlinks, for which we actually /do/ want to be able to
lock out readers and change the data fork after symlink creation... but
I was saving that for 2022 because I'm too overwhelmed to try to send
that again.)

--D

> 
> Ian
> > 
> > One prevented memory leak? ;)
> > 
> > It won't be freed in xfs_inode_free_callback() because we change the
> > data fork format type (and clear i_mode) in this path. Perhaps that
> > could use an audit, but that's a separate issue.
> > 
> > > >                 ip->i_df.if_bytes = 0;
> > > >         }
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index a607d6aca5c4..e98d7f10ba7d 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> > > >         return ERR_PTR(error);
> > > >  }
> > > >  
> > > > -STATIC const char *
> > > > -xfs_vn_get_link_inline(
> > > > -       struct dentry           *dentry,
> > > > -       struct inode            *inode,
> > > > -       struct delayed_call     *done)
> > > > -{
> > > > -       struct xfs_inode        *ip = XFS_I(inode);
> > > > -       char                    *link;
> > > > -
> > > > -       ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > > > -
> > > > -       /*
> > > > -        * The VFS crashes on a NULL pointer, so return -
> > > > EFSCORRUPTED if
> > > > -        * if_data is junk.
> > > > -        */
> > > > -       link = ip->i_df.if_u1.if_data;
> > > > -       if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > > -               return ERR_PTR(-EFSCORRUPTED);
> > > > -       return link;
> > > > -}
> > > > -
> > > >  static uint32_t
> > > >  xfs_stat_blksize(
> > > >         struct xfs_inode        *ip)
> > > > @@ -1250,14 +1229,6 @@ static const struct inode_operations
> > > > xfs_symlink_inode_operations = {
> > > >         .update_time            = xfs_vn_update_time,
> > > >  };
> > > >  
> > > > -static const struct inode_operations
> > > > xfs_inline_symlink_inode_operations = {
> > > > -       .get_link               = xfs_vn_get_link_inline,
> > > > -       .getattr                = xfs_vn_getattr,
> > > > -       .setattr                = xfs_vn_setattr,
> > > > -       .listxattr              = xfs_vn_listxattr,
> > > > -       .update_time            = xfs_vn_update_time,
> > > > -};
> > > > -
> > > >  /* Figure out if this file actually supports DAX. */
> > > >  static bool
> > > >  xfs_inode_supports_dax(
> > > > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> > > >                 break;
> > > >         case S_IFLNK:
> > > >                 if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > > > -                       inode->i_op =
> > > > &xfs_inline_symlink_inode_operations;
> > > > -               else
> > > > -                       inode->i_op =
> > > > &xfs_symlink_inode_operations;
> > > > +                       inode->i_link = ip->i_df.if_u1.if_data;
> > > > +               inode->i_op = &xfs_symlink_inode_operations;
> > > 
> > > This still needs corruption checks - ip->i_df.if_u1.if_data can be
> > > null if there's some kind of inode corruption detected.
> > > 
> > 
> > It's fine for i_link to be NULL. We'd just fall into the get_link()
> > call
> > and have to handle it there like the current callback does.
> > 
> > However, this does need to restore some of the code removed from
> > xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
> > symlinks") to handle the local format case. If if_data can be NULL
> > we'll
> > obviously need to handle it there anyways.
> > 
> > If there's no fundamental objection I'll address these issues, give
> > it
> > some proper testing and send a real patch..
> > 
> > Brian
> > 
> > > >                 break;
> > > >         default:
> > > >                 inode->i_op = &xfs_inode_operations;
> > > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > > > index fc2c6a404647..20ec2f450c56 100644
> > > > --- a/fs/xfs/xfs_symlink.c
> > > > +++ b/fs/xfs/xfs_symlink.c
> > > > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> > > >          * do here in that case.
> > > >          */
> > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > +               WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> > > 
> > > Again, rcu_assign_pointer(), yes?
> > > 
> > > >                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >                 return 0;
> > > >         }
> > > > 
> > > > 
> > > 
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> 
>
Ian Kent Dec. 16, 2021, 2:45 a.m. UTC | #30
On Tue, 2021-12-14 at 21:06 -0800, Darrick J. Wong wrote:
> On Wed, Dec 15, 2021 at 11:54:11AM +0800, Ian Kent wrote:
> > On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote:
> > > On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote:
> > > > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster
> > > > > > wrote:
> > > > > > > In
> > > > > > > any event, another experiment I ran in light of the above
> > > > > > > results that
> > > > > > > might be similar was to put the inode queueing component
> > > > > > > of
> > > > > > > destroy_inode() behind an rcu callback. This reduces the
> > > > > > > single threaded
> > > > > > > perf hit from the previous approach by about 50%. So not
> > > > > > > entirely
> > > > > > > baseline performance, but it's back closer to baseline if
> > > > > > > I
> > > > > > > double the
> > > > > > > throttle threshold (and actually faster at 4x). Perhaps
> > > > > > > my
> > > > > > > crude
> > > > > > > prototype logic could be optimized further to not rely on
> > > > > > > percpu
> > > > > > > threshold changes to match the baseline.
> > > > > > > 
> > > > > > > My overall takeaway from these couple hacky experiments
> > > > > > > is
> > > > > > > that the
> > > > > > > unconditional synchronous rcu wait is indeed probably too
> > > > > > > heavy weight,
> > > > > > > as you point out. The polling or callback (or perhaps
> > > > > > > your
> > > > > > > separate
> > > > > > > queue) approach seems to be in the ballpark of viability,
> > > > > > > however,
> > > > > > > particularly when we consider the behavior of scaled or
> > > > > > > mixed
> > > > > > > workloads
> > > > > > > (since inactive queue processing seems to be size driven
> > > > > > > vs.
> > > > > > > latency
> > > > > > > driven).
> > > > > > > 
> > > > > > > So I dunno.. if you consider the various severity and
> > > > > > > complexity
> > > > > > > tradeoffs, this certainly seems worth more consideration
> > > > > > > to
> > > > > > > me. I can
> > > > > > > think of other potentially interesting ways to experiment
> > > > > > > with
> > > > > > > optimizing the above or perhaps tweak queueing to better
> > > > > > > facilitate
> > > > > > > taking advantage of grace periods, but it's not worth
> > > > > > > going
> > > > > > > too far down
> > > > > > > that road if you're wedded to the "busy inodes" approach.
> > > > > > 
> > > > > > I'm not wedded to "busy inodes" but, as your experiments
> > > > > > are
> > > > > > indicating, trying to handle rcu grace periods into the
> > > > > > deferred
> > > > > > inactivation path isn't completely mitigating the impact of
> > > > > > having
> > > > > > to wait for a grace period for recycling of inodes.
> > > > > > 
> > > > > 
> > > > > What I'm seeing so far is that the impact seems to be limited
> > > > > to
> > > > > the
> > > > > single threaded workload and largely mitigated by an increase
> > > > > in
> > > > > the
> > > > > percpu throttle limit. IOW, it's not completely free right
> > > > > out of
> > > > > the
> > > > > gate, but the impact seems isolated and potentially mitigated
> > > > > by
> > > > > adjustment of the pipeline.
> > > > > 
> > > > > I realize the throttle is a percpu value, so that is what has
> > > > > me
> > > > > wondering about some potential for gains in efficiency to try
> > > > > and
> > > > > get
> > > > > more of that single-threaded performance back in other ways,
> > > > > or
> > > > > perhaps
> > > > > enhancements that might be more broadly beneficial to
> > > > > deferred
> > > > > inactivations in general (i.e. some form of adaptive
> > > > > throttling
> > > > > thresholds to balance percpu thresholds against a global
> > > > > threshold).
> > > > 
> > > > I ran experiments on queue depth early on. Once we go over a
> > > > few
> > > > tens of inodes we start to lose the "hot cache" effect and
> > > > performance starts to go backwards. By queue depths of
> > > > hundreds,
> > > > we've lost all the hot cache and nothing else gets that
> > > > performance
> > > > back because we can't avoid the latency of all the memory
> > > > writes
> > > > from cache eviction and the followup memory loads that result.
> > > > 
> > > 
> > > Admittedly my testing is simple/crude as I'm just exploring the
> > > potential viability of a concept, not fine tuning a workload,
> > > etc.
> > > That
> > > said, I'm curious to know what your tests for this look like
> > > because
> > > I
> > > suspect I'm running into different conditions. My tests
> > > frequently
> > > hit
> > > the percpu throttle threshold (256 inodes), which is beyond your
> > > ideal
> > > tens of inodes range (and probably more throttle limited than cpu
> > > cache
> > > limited).
> > > 
> > > > Making the per-cpu queues longer or shorter based on global
> > > > state
> > > > won't gain us anything. ALl it will do is slow down local
> > > > operations
> > > > that don't otherwise need slowing down....
> > > > 
> > > 
> > > This leaves out context. The increase in throttle threshold
> > > mitigates
> > > the delays I've introduced via the rcu callback. That happens to
> > > produce
> > > observable results comparable to my baseline test, but it's more
> > > of a
> > > measure of the impact of the delay than a direct proposal. If
> > > there's
> > > a
> > > more fine grained test worth running here (re: above), please
> > > describe
> > > it.
> > > 
> > > > > > I suspect a rethink on the inode recycling mechanism is
> > > > > > needed.
> > > > > > THe
> > > > > > way it is currently implemented was a brute force solution
> > > > > > - it
> > > > > > is
> > > > > > simple and effective. However, we need more nuance in the
> > > > > > recycling
> > > > > > logic now.  That is, if we are recycling an inode that is
> > > > > > clean, has
> > > > > > nlink >=1 and has not been unlinked, it means the VFS
> > > > > > evicted
> > > > > > it too
> > > > > > soon and we are going to re-instantiate it as the identical
> > > > > > inode
> > > > > > that was evicted from cache.
> > > > > > 
> > > > > 
> > > > > Probably. How advantageous is inode memory reuse supposed to
> > > > > be
> > > > > in the
> > > > > first place? I've more considered it a necessary side effect
> > > > > of
> > > > > broader
> > > > > architecture (i.e. deferred reclaim, etc.) as opposed to a
> > > > > primary
> > > > > optimization.
> > > > 
> > > > Yes, it's an architectural feature resulting from the
> > > > filesystem
> > > > inode life cycle being different to the VFS inode life cycle.
> > > > This
> > > > was inherited from Irix - it had separate inactivation vs
> > > > reclaim
> > > > states and action steps for vnodes - inactivation occurred when
> > > > the
> > > > vnode refcount went to zero, reclaim occurred when the vnode
> > > > was to
> > > > be freed.
> > > > 
> > > > Architecturally, Linux doesn't have this two-step
> > > > infrastructure;
> > > > it
> > > > just has evict() that runs everything when the inode needs to
> > > > be
> > > > reclaimed. Hence we hide the two-phase reclaim architecture of
> > > > XFS
> > > > behind that, and so we always had this troublesome impedence
> > > > mismatch on Linux.
> > > > 
> > > 
> > > Ok, that was generally how I viewed it.
> > > 
> > > > Thinking a bit about this, maybe there is a more integrated way
> > > > to
> > > > handle this life cycle impedence mismatch by making the way we
> > > > interact with the linux inode cache to be more ...  Irix like.
> > > > Linux
> > > > does actually give us a a callback when the last reference to
> > > > an
> > > > inode goes away: .drop_inode()
> > > > 
> > > > i.e. Maybe we should look to triggering inactivation work from
> > > > ->drop_inode instead of ->destroy_inode and hence always
> > > > leaving
> > > > unreferenced, reclaimable inodes in the VFS cache on the LRU.
> > > > i.e.
> > > > rather than hiding the two-phase XFS inode inactivation+reclaim
> > > > algorithm from the VFS, move it up into the VFS.  If we prevent
> > > > inodes from being reclaimed from the LRU until they have
> > > > finished
> > > > inactivation and are clean (easy enough just by marking the
> > > > inode
> > > > as
> > > > dirty), that would allow us to immediately reclaim and free
> > > > inodes
> > > > in evict() context. Integration with __wait_on_freeing_inode()
> > > > would
> > > > like solve the RCU reuse/recycling issues.
> > > > 
> > > 
> > > Hmm.. this is the point where we decide whether the inode remains
> > > cached, which is currently basically whether the inode has a link
> > > count
> > > or not. That makes me curious what (can) happens with an
> > > unlinked/inactivated inode on the lru. I'm not sure any other fs'
> > > do
> > > anything like that currently..?
> > > 
> > > > There's more to it than just this, but perhaps the longer term
> > > > direction should be closer integration with the Linux inode
> > > > cache
> > > > life cycle rather than trying to solve all these problems
> > > > underneath
> > > > the VFS cache whilst still trying to play by it's rules...
> > > > 
> > > 
> > > Yeah. Caching logic details aside, I think that makes sense.
> > > 
> > > > > I still see reuse occur with deferred inactivation, we
> > > > > just end up cycling through the same set of inodes as they
> > > > > fall
> > > > > through
> > > > > the queue rather than recycling the same one over and over.
> > > > > I'm
> > > > > sort of
> > > > > wondering what the impact would be if we didn't do this at
> > > > > all
> > > > > (for the
> > > > > new allocation case at least).
> > > > 
> > > > We end up with a larger pool of free inodes in the finobt. This
> > > > is
> > > > basically what my "busy inode check" proposal is based on -
> > > > inodes
> > > > that we can't allocate without recycling just remain on the
> > > > finobt
> > > > for longer before they can be used. This would be awful if we
> > > > didn't
> > > > have the finobt to efficiently locate free inodes - the finobt
> > > > record iteration makes it pretty low overhead to scan inodes
> > > > here.
> > > > 
> > > 
> > > I get the idea. That last bit is what I'm skeptical about. The
> > > finobt
> > > is
> > > based on the premise that free inode lookup becomes a predictable
> > > tree
> > > lookup instead of the old searching algorithm on the inobt, which
> > > we
> > > still support and can be awful in its own right under worst case
> > > conditions. I agree that this would be bad on the inobt (which
> > > raises
> > > the question on how we'd provide these recycling correctness
> > > guarantees
> > > on !finobt fs'). What I'm more concerned about is whether this
> > > could
> > > make finobt enabled fs' (transiently) just as poor as the old
> > > algo
> > > under
> > > certain workloads/conditions.
> > > 
> > > I think there needs to be at least some high level description of
> > > the
> > > search algorithm before we can sufficiently reason about it's
> > > behavior..
> > > 
> > > > > > So how much re-initialisation do we actually need for that
> > > > > > inode?
> > > > > > Almost everything in the inode is still valid; the problems
> > > > > > come
> > > > > > from inode_init_always() resetting the entire internal
> > > > > > inode
> > > > > > state
> > > > > > and XFS then having to set them up again.  The internal
> > > > > > state
> > > > > > is
> > > > > > already largely correct when we start recycling, and the
> > > > > > identity of
> > > > > > the recycled inode does not change when nlink >= 1. Hence
> > > > > > eliding
> > > > > > inode_init_always() would also go a long way to avoiding
> > > > > > the
> > > > > > need
> > > > > > for a RCU grace period to pass before we can make the inode
> > > > > > visible
> > > > > > to the VFS again.
> > > > > > 
> > > > > > If we can do that, then the only indoes that need a grace
> > > > > > period
> > > > > > before they can be recycled are unlinked inodes as they
> > > > > > change
> > > > > > identity when being recycled. That identity change
> > > > > > absolutely
> > > > > > requires a grace period to expire before the new
> > > > > > instantiation
> > > > > > can
> > > > > > be made visible.  Given the arbitrary delay that this can
> > > > > > introduce
> > > > > > for an inode allocation operation, it seems much better
> > > > > > suited
> > > > > > to
> > > > > > detecting busy inodes than waiting for a global OS state
> > > > > > change
> > > > > > to
> > > > > > occur...
> > > > > > 
> > > > > 
> > > > > Maybe..? The experiments I've been doing are aimed at
> > > > > simplicity
> > > > > and
> > > > > reducing the scope of the changes. Part of the reason for
> > > > > this is
> > > > > tbh
> > > > > I'm not totally convinced we really need to do anything more
> > > > > complex
> > > > > than preserve the inline symlink buffer one way or another
> > > > > (for
> > > > > example,
> > > > > see the rfc patch below for an alternative to the inline
> > > > > symlink
> > > > > rcuwalk
> > > > > disable patch). Maybe we should consider something like this
> > > > > anyways.
> > > > > 
> > > > > With busy inodes, we need to alter inode allocation to some
> > > > > degree to
> > > > > accommodate. We can have (tens of?) thousands of inodes under
> > > > > the
> > > > > grace
> > > > > period at any time based on current batching behavior, so
> > > > > it's
> > > > > not
> > > > > totally evident to me that we won't end up with some of the
> > > > > same
> > > > > fundamental issues to deal with here, just needing to be
> > > > > accommodated in
> > > > > the inode allocation algorithm rather than the teardown
> > > > > sequence.
> > > > 
> > > > Sure, but the purpose of the allocation selection
> > > > policy is to select the best inode to allocate for the current
> > > > context.  The cost of not being able to use an inode
> > > > immediately
> > > > needs to be factored into that allocation policy. i.e. if the
> > > > selected inode has an associated delay with it before it can be
> > > > reused and other free inodes don't, then we should not be
> > > > selecting
> > > > the inode with a delay associcated with it.
> > > > 
> > > 
> > > We still have to find those "no delay" inodes. AFAICT the worst
> > > case
> > > conditions on the system I've been playing with can have
> > > something
> > > like
> > > 20k free && busy inodes. That could cover all or most of the
> > > finobt
> > > at
> > > the time of an inode allocation. What happens from there depends
> > > on
> > > the
> > > search algorithm.
> > > 
> > > > This is exactly the reasoning and logic we use for busy
> > > > extents. 
> > > > We
> > > > only take the blocking penalty for resolving busy extent state
> > > > if
> > > > we
> > > > run out of free extents to search before we've found an
> > > > allocation
> > > > candidate. I think it makes sense for inode allocation, too.
> > > > 
> > > 
> > > Sure, the idea makes sense and it's worth looking into. But there
> > > are
> > > enough contextual differences that I wouldn't just assume the
> > > same
> > > logic
> > > translates over to the finobt without potential for performance
> > > impact.
> > > For example, extent allocation has some advantages with things
> > > like
> > > delalloc (physical block allocation occurs async from buffered
> > > write
> > > syscall time) and the fact that metadata allocs can reuse busy
> > > blocks.
> > > The finobt only tracks existing chunks with free inodes, so it's
> > > easily
> > > possible to have conditions where the finobt is 100% (or
> > > majority)
> > > populated with busy inodes (whether it be one inode or several
> > > thousand).
> > > 
> > > This raises questions like at what point does search cost become
> > > a
> > > factor? At what point and with what frequency do we suffer the
> > > blocking
> > > penalty? Do we opt to allocate new chunks based on gp state?
> > > Something
> > > else? We don't need to answer these questions here (this thread
> > > is
> > > long
> > > enough :P). I'm just trying to say that it's one thing to
> > > consider
> > > the
> > > approach a viable option, but it isn't automatically preferable
> > > just
> > > because we use it for extents. Further details beyond "detect
> > > busy
> > > inodes" would be nice to objectively reason about.
> > > 
> > > > > --- 8< ---
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 64b9bf334806..058e3fc69ff7 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -2644,7 +2644,7 @@ xfs_ifree(
> > > > >          * already been freed by xfs_attr_inactive.
> > > > >          */
> > > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > > -               kmem_free(ip->i_df.if_u1.if_data);
> > > > > +               kfree_rcu(ip->i_df.if_u1.if_data);
> > > > >                 ip->i_df.if_u1.if_data = NULL;
> > > > 
> > > > That would need to be rcu_assign_pointer(ip-
> > > > >i_df.if_u1.if_data,
> > > > NULL) to put the correct memory barriers in place, right? Also,
> > > > I
> > > > think ip->i_df.if_u1.if_data needs to be set to NULL before
> > > > calling
> > > > kfree_rcu() so racing lookups will always see NULL before
> > > > the object is freed.
> > > > 
> > > 
> > > I think rcu_assign_pointer() is intended to be paired with the
> > > associated rcu deref and for scenarios like making sure an object
> > > isn't
> > > made available until it's completely initialized (i.e. such as
> > > for
> > > rcu
> > > protected list traversals, etc.).
> > > 
> > > With regard to ordering, we no longer access if_data in rcuwalk
> > > mode
> > > with this change. Thus I think all we need here is the
> > > WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the
> > > vfs,
> > > and
> > > that happens earlier in xfs_inactive_symlink() before we rcu free
> > > the
> > > memory here. With that, ISTM a racing lookup should either see an
> > > rcu
> > > protected i_link or NULL, the latter of which calls into -
> > > >get_link()
> > > and triggers refwalk mode. Hm?
> > > 
> > > > But again, as I asked up front, why do we even need to free
> > > > this
> > > > memory buffer here? It will be freed in
> > > > xfs_inode_free_callback()
> > > > after the current RCU grace period expires, so what do we gain
> > > > by
> > > > freeing it separately here?
> > > > 
> > 
> > The thing that's been bugging me is not knowing if the VFS has
> > finished using the link string.
> > 
> > The link itself can be removed while the link path could still
> > be valid and the VFS will then walk that path. So rcu grace time
> > might not be sufficient but stashing the pointer and rcu freeing
> > it ensures the pointer won't go away while the walk is under way
> > without any grace period guessing. Relating this to the current
> > path walk via an rcu delayed free is a reliable way to do what's
> > needed.
> > 
> > The only problem here is that path string remaining while it is
> > being used. If there aren't any other known problems with the
> > xfs inode re-use sub system I don't see any worth in complicating
> > it to cater for this special case.
> > 
> > Brian's patch is a variation on the original patch and is all
> > that's really needed. IMHO going this way (whatever we end up
> > with) is the sensible thing to do.
> 
> Why not simply change xfs_readlink to memcpy ip->i_df.if_u1.if_data
> into
> the caller's link buffer?  We shouldn't be exposing internal XFS
> metadata buffers to the VFS to scribble on without telling us; this
> gets
> rid of the dual inode_operations for symlinks; and we're probably
> breaking a locking rule somewhere by not taking any locks AFAICT. 
> That
> seems a lot less complex than adding rcu freeing rules to understand
> how
> to handle local format file forks.

Ahhhaa ... I didn't/don't understand what these inline symlinks
are.

But it seems they aren't too different from the usual symlinks
and treating them as such makes the problem go away.

I'm pretty sure I was missing the point of some of this discussion
too because I just didn't get it.

Thanks for helping me get what's going on here Darrick.

Ian
> 
> (I say this from the vantage point of online repair, which will try
> to
> salvage damaged symlinks, for which we actually /do/ want to be able
> to
> lock out readers and change the data fork after symlink creation...
> but
> I was saving that for 2022 because I'm too overwhelmed to try to send
> that again.)
> 
> --D
> 
> > 
> > Ian
> > > 
> > > One prevented memory leak? ;)
> > > 
> > > It won't be freed in xfs_inode_free_callback() because we change
> > > the
> > > data fork format type (and clear i_mode) in this path. Perhaps
> > > that
> > > could use an audit, but that's a separate issue.
> > > 
> > > > >                 ip->i_df.if_bytes = 0;
> > > > >         }
> > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > > index a607d6aca5c4..e98d7f10ba7d 100644
> > > > > --- a/fs/xfs/xfs_iops.c
> > > > > +++ b/fs/xfs/xfs_iops.c
> > > > > @@ -511,27 +511,6 @@ xfs_vn_get_link(
> > > > >         return ERR_PTR(error);
> > > > >  }
> > > > >  
> > > > > -STATIC const char *
> > > > > -xfs_vn_get_link_inline(
> > > > > -       struct dentry           *dentry,
> > > > > -       struct inode            *inode,
> > > > > -       struct delayed_call     *done)
> > > > > -{
> > > > > -       struct xfs_inode        *ip = XFS_I(inode);
> > > > > -       char                    *link;
> > > > > -
> > > > > -       ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
> > > > > -
> > > > > -       /*
> > > > > -        * The VFS crashes on a NULL pointer, so return -
> > > > > EFSCORRUPTED if
> > > > > -        * if_data is junk.
> > > > > -        */
> > > > > -       link = ip->i_df.if_u1.if_data;
> > > > > -       if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > > > -               return ERR_PTR(-EFSCORRUPTED);
> > > > > -       return link;
> > > > > -}
> > > > > -
> > > > >  static uint32_t
> > > > >  xfs_stat_blksize(
> > > > >         struct xfs_inode        *ip)
> > > > > @@ -1250,14 +1229,6 @@ static const struct inode_operations
> > > > > xfs_symlink_inode_operations = {
> > > > >         .update_time            = xfs_vn_update_time,
> > > > >  };
> > > > >  
> > > > > -static const struct inode_operations
> > > > > xfs_inline_symlink_inode_operations = {
> > > > > -       .get_link               = xfs_vn_get_link_inline,
> > > > > -       .getattr                = xfs_vn_getattr,
> > > > > -       .setattr                = xfs_vn_setattr,
> > > > > -       .listxattr              = xfs_vn_listxattr,
> > > > > -       .update_time            = xfs_vn_update_time,
> > > > > -};
> > > > > -
> > > > >  /* Figure out if this file actually supports DAX. */
> > > > >  static bool
> > > > >  xfs_inode_supports_dax(
> > > > > @@ -1409,9 +1380,8 @@ xfs_setup_iops(
> > > > >                 break;
> > > > >         case S_IFLNK:
> > > > >                 if (ip->i_df.if_format ==
> > > > > XFS_DINODE_FMT_LOCAL)
> > > > > -                       inode->i_op =
> > > > > &xfs_inline_symlink_inode_operations;
> > > > > -               else
> > > > > -                       inode->i_op =
> > > > > &xfs_symlink_inode_operations;
> > > > > +                       inode->i_link = ip-
> > > > > >i_df.if_u1.if_data;
> > > > > +               inode->i_op = &xfs_symlink_inode_operations;
> > > > 
> > > > This still needs corruption checks - ip->i_df.if_u1.if_data can
> > > > be
> > > > null if there's some kind of inode corruption detected.
> > > > 
> > > 
> > > It's fine for i_link to be NULL. We'd just fall into the
> > > get_link()
> > > call
> > > and have to handle it there like the current callback does.
> > > 
> > > However, this does need to restore some of the code removed from
> > > xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline
> > > symlinks") to handle the local format case. If if_data can be
> > > NULL
> > > we'll
> > > obviously need to handle it there anyways.
> > > 
> > > If there's no fundamental objection I'll address these issues,
> > > give
> > > it
> > > some proper testing and send a real patch..
> > > 
> > > Brian
> > > 
> > > > >                 break;
> > > > >         default:
> > > > >                 inode->i_op = &xfs_inode_operations;
> > > > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> > > > > index fc2c6a404647..20ec2f450c56 100644
> > > > > --- a/fs/xfs/xfs_symlink.c
> > > > > +++ b/fs/xfs/xfs_symlink.c
> > > > > @@ -497,6 +497,7 @@ xfs_inactive_symlink(
> > > > >          * do here in that case.
> > > > >          */
> > > > >         if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > > > +               WRITE_ONCE(VFS_I(ip)->i_link, NULL);
> > > > 
> > > > Again, rcu_assign_pointer(), yes?
> > > > 
> > > > >                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >                 return 0;
> > > > >         }
> > > > > 
> > > > > 
> > > > 
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > > 
> > > 
> > 
> >
diff mbox series

Patch

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 54da6d717a06..c1bd1103b340 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -61,6 +61,10 @@  static inline void  kmem_free(const void *ptr)
 {
 	kvfree(ptr);
 }
+static inline void  kmem_free_rcu(const void *ptr)
+{
+	kvfree_rcu(ptr);
+}
 
 
 static inline void *
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a4f6f034fb81..aaa1911e61ed 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2650,8 +2650,8 @@  xfs_ifree(
 	 * already been freed by xfs_attr_inactive.
 	 */
 	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
-		kmem_free(ip->i_df.if_u1.if_data);
-		ip->i_df.if_u1.if_data = NULL;
+		kmem_free_rcu(ip->i_df.if_u1.if_data);
+		RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
 		ip->i_df.if_bytes = 0;
 	}
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..2977e19da7b7 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -524,11 +524,17 @@  xfs_vn_get_link_inline(
 
 	/*
 	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
-	 * if_data is junk.
+	 * if_data is junk. Also, if the path walk is in rcu-walk mode
+	 * and the inode link path has gone away due inode re-use we have
+	 * no choice but to tell the VFS to redo the lookup.
 	 */
-	link = ip->i_df.if_u1.if_data;
+	link = rcu_dereference(ip->i_df.if_u1.if_data);
+	if (!dentry && !link)
+		return ERR_PTR(-ECHILD);
+
 	if (XFS_IS_CORRUPT(ip->i_mount, !link))
 		return ERR_PTR(-EFSCORRUPTED);
+
 	return link;
 }