vfs: check dentry is still valid in get_link()

Message ID	164180589176.86426.501271559065590169.stgit@mickey.themaw.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [PATCH] vfs: check dentry is still valid in get_link() From: Ian Kent <raven@themaw.net> To: Al Viro <viro@ZenIV.linux.org.uk> Cc: Brian Foster <bfoster@redhat.com>, "Darrick J. Wong" <djwong@kernel.org>, Christoph Hellwig <hch@lst.de>, Miklos Szeredi <miklos@szeredi.hu>, David Howells <dhowells@redhat.com>, Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, xfs <linux-xfs@vger.kernel.org> Date: Mon, 10 Jan 2022 17:11:31 +0800 Message-ID: <164180589176.86426.501271559065590169.stgit@mickey.themaw.net> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfs: check dentry is still valid in get_link() \| expand vfs: check dentry is still valid in get_link()

Ian Kent Jan. 10, 2022, 9:11 a.m. UTC

When following a trailing symlink in rcu-walk mode it's possible for
the dentry to become invalid between the last dentry seq lock check
and getting the link (eg. an unlink) leading to a backtrace similar
to this:

crash> bt
PID: 10964  TASK: ffff951c8aa92f80  CPU: 3   COMMAND: "TaniumCX"
…
 #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffffae44d0a6fc90  RFLAGS: 00010246
    RAX: ffffffff8da3cc80  RBX: ffffae44d0a6fd30  RCX: 0000000000000000
    RDX: ffffae44d0a6fd98  RSI: ffff951aa9af3008  RDI: 0000000000000000
    RBP: 0000000000000000   R8: ffffae44d0a6fb94   R9: 0000000000000000
    R10: ffff951c95d8c318  R11: 0000000000080000  R12: ffffae44d0a6fd98
    R13: ffff951aa9af3008  R14: ffff951c8c9eb840  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
 #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
#10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
#11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
#12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
#13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b

Most of the time this is not a problem because the inode is unchanged
while the rcu read lock is held.

But xfs can re-use inodes which can result in the inode ->get_link()
method becoming invalid (or NULL).

This case needs to be checked for in fs/namei.c:get_link() and if
detected the walk re-started.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/namei.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Al Viro Jan. 15, 2022, 6:38 a.m. UTC | #1

On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote:
> When following a trailing symlink in rcu-walk mode it's possible for
> the dentry to become invalid between the last dentry seq lock check
> and getting the link (eg. an unlink) leading to a backtrace similar
> to this:
> 
> crash> bt
> PID: 10964  TASK: ffff951c8aa92f80  CPU: 3   COMMAND: "TaniumCX"
> …
>  #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
>     [exception RIP: unknown or invalid address]
>     RIP: 0000000000000000  RSP: ffffae44d0a6fc90  RFLAGS: 00010246
>     RAX: ffffffff8da3cc80  RBX: ffffae44d0a6fd30  RCX: 0000000000000000
>     RDX: ffffae44d0a6fd98  RSI: ffff951aa9af3008  RDI: 0000000000000000
>     RBP: 0000000000000000   R8: ffffae44d0a6fb94   R9: 0000000000000000
>     R10: ffff951c95d8c318  R11: 0000000000080000  R12: ffffae44d0a6fd98
>     R13: ffff951aa9af3008  R14: ffff951c8c9eb840  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
>  #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
> #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
> #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
> #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
> #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b
> 
> Most of the time this is not a problem because the inode is unchanged
> while the rcu read lock is held.
> 
> But xfs can re-use inodes which can result in the inode ->get_link()
> method becoming invalid (or NULL).

Without an RCU delay?  Then we have much worse problems...

Details, please.

Ian Kent Jan. 17, 2022, 2:55 a.m. UTC | #2

On Sat, 2022-01-15 at 06:38 +0000, Al Viro wrote:
> On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote:
> > When following a trailing symlink in rcu-walk mode it's possible
> > for
> > the dentry to become invalid between the last dentry seq lock check
> > and getting the link (eg. an unlink) leading to a backtrace similar
> > to this:
> > 
> > crash> bt
> > PID: 10964  TASK: ffff951c8aa92f80  CPU: 3   COMMAND: "TaniumCX"
> > …
> >  #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
> >     [exception RIP: unknown or invalid address]
> >     RIP: 0000000000000000  RSP: ffffae44d0a6fc90  RFLAGS: 00010246
> >     RAX: ffffffff8da3cc80  RBX: ffffae44d0a6fd30  RCX:
> > 0000000000000000
> >     RDX: ffffae44d0a6fd98  RSI: ffff951aa9af3008  RDI:
> > 0000000000000000
> >     RBP: 0000000000000000   R8: ffffae44d0a6fb94   R9:
> > 0000000000000000
> >     R10: ffff951c95d8c318  R11: 0000000000080000  R12:
> > ffffae44d0a6fd98
> >     R13: ffff951aa9af3008  R14: ffff951c8c9eb840  R15:
> > 0000000000000000
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >  #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
> >  #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
> > #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
> > #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
> > #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
> > #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b
> > 
> > Most of the time this is not a problem because the inode is
> > unchanged
> > while the rcu read lock is held.
> > 
> > But xfs can re-use inodes which can result in the inode -
> > >get_link()
> > method becoming invalid (or NULL).
> 
> Without an RCU delay?  Then we have much worse problems...

Sorry for the delay.

That was a problem that was discussed at length with the original post
of this patch that included a patch for this too (misguided though it
was).

That discussion resulted in Darrick merging the problem xfs inline
symlink handling with the xfs normal symlink handling.

Another problem with these inline syslinks was they would hand a
pointer to internal xfs storage to the VFS. Darrick's change
allocates and copies the link then hands it to the VFS to free
after use. And since there's an allocation in the symlink handler
the rcu-walk case returns -ECHILD (on passed NULL dentry) so the
VFS will call unlazy before that next call which I think is itself
enough to resolve this problem.

The only thing I think might be questionable is the VFS copy of the
inode pointer but I think the inode is rcu freed so it will be
around and the seq count will have changed so I think it should be
ok.

If I'm missing something please say so, ;)

Darrick's patch is (was last I looked) in his xfs-next tree.

Ian

Brian Foster Jan. 17, 2022, 2:35 p.m. UTC | #3

On Mon, Jan 17, 2022 at 10:55:32AM +0800, Ian Kent wrote:
> On Sat, 2022-01-15 at 06:38 +0000, Al Viro wrote:
> > On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote:
> > > When following a trailing symlink in rcu-walk mode it's possible
> > > for
> > > the dentry to become invalid between the last dentry seq lock check
> > > and getting the link (eg. an unlink) leading to a backtrace similar
> > > to this:
> > > 
> > > crash> bt
> > > PID: 10964  TASK: ffff951c8aa92f80  CPU: 3   COMMAND: "TaniumCX"
> > > …
> > >  #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe
> > >     [exception RIP: unknown or invalid address]
> > >     RIP: 0000000000000000  RSP: ffffae44d0a6fc90  RFLAGS: 00010246
> > >     RAX: ffffffff8da3cc80  RBX: ffffae44d0a6fd30  RCX:
> > > 0000000000000000
> > >     RDX: ffffae44d0a6fd98  RSI: ffff951aa9af3008  RDI:
> > > 0000000000000000
> > >     RBP: 0000000000000000   R8: ffffae44d0a6fb94   R9:
> > > 0000000000000000
> > >     R10: ffff951c95d8c318  R11: 0000000000080000  R12:
> > > ffffae44d0a6fd98
> > >     R13: ffff951aa9af3008  R14: ffff951c8c9eb840  R15:
> > > 0000000000000000
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >  #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61
> > >  #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1
> > > #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700
> > > #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4
> > > #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9
> > > #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b
> > > 
> > > Most of the time this is not a problem because the inode is
> > > unchanged
> > > while the rcu read lock is held.
> > > 
> > > But xfs can re-use inodes which can result in the inode -
> > > >get_link()
> > > method becoming invalid (or NULL).
> > 
> > Without an RCU delay?  Then we have much worse problems...
> 
> Sorry for the delay.
> 
> That was a problem that was discussed at length with the original post
> of this patch that included a patch for this too (misguided though it
> was).
> 

To Al's question, at the end of the day there is no rcu delay involved
with inode reuse in XFS. We do use call_rcu() for eventual freeing of
inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that
have been put into a "reclaim" state before getting to the point of
freeing the struct inode memory. This lead to the long discussion [1]
Ian references around ways to potentially deal with that. I think the
TLDR of that thread is there are various potential options for
improvement, such as to rcu wait on inode creation/reuse (either
explicitly or via more open coded grace period cookie tracking), to rcu
wait somewhere in the destroy sequence before inodes become reuse
candidates, etc., but none of them seemingly agreeable for varying
reasons (IIRC mostly stemming from either performance or compexity) [2].

The change that has been made so far in XFS is to turn rcuwalk for
symlinks off once again, which looks like landed in Linus' tree as
commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata
buffers to the vfs"). The hope is that between that patch and this
prospective vfs tweak, we can have a couple incremental fixes that at
least address the practical problem users have been running into (which
is a crash due to a NULL ->get_link() callback pointer due to inode
reuse). The inode reuse vs. rcu thing might still be a broader problem,
but AFAIA that mechanism has been in place in XFS on Linux pretty much
forever.

Brian

[1] https://lore.kernel.org/linux-fsdevel/163660197073.22525.11235124150551283676.stgit@mickey.themaw.net/

[2] Yet another idea could be a mix of two of the previously discussed
approaches: stamp the current rcu gp marker in the xfs_inode somewhere
on destroy and check it on reuse to conditionally rcu wait when
necessary. Perhaps that might provide enough batching to mitigate
performance impact when compared to an unconditional create side wait.

> That discussion resulted in Darrick merging the problem xfs inline
> symlink handling with the xfs normal symlink handling.
> 
> Another problem with these inline syslinks was they would hand a
> pointer to internal xfs storage to the VFS. Darrick's change
> allocates and copies the link then hands it to the VFS to free
> after use. And since there's an allocation in the symlink handler
> the rcu-walk case returns -ECHILD (on passed NULL dentry) so the
> VFS will call unlazy before that next call which I think is itself
> enough to resolve this problem.
> 
> The only thing I think might be questionable is the VFS copy of the
> inode pointer but I think the inode is rcu freed so it will be
> around and the seq count will have changed so I think it should be
> ok.
> 
> If I'm missing something please say so, ;)
> 
> Darrick's patch is (was last I looked) in his xfs-next tree.
> 
> Ian
> 
>

Al Viro Jan. 17, 2022, 4:28 p.m. UTC | #4

On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote:

> To Al's question, at the end of the day there is no rcu delay involved
> with inode reuse in XFS. We do use call_rcu() for eventual freeing of
> inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that
> have been put into a "reclaim" state before getting to the point of
> freeing the struct inode memory. This lead to the long discussion [1]
> Ian references around ways to potentially deal with that. I think the
> TLDR of that thread is there are various potential options for
> improvement, such as to rcu wait on inode creation/reuse (either
> explicitly or via more open coded grace period cookie tracking), to rcu
> wait somewhere in the destroy sequence before inodes become reuse
> candidates, etc., but none of them seemingly agreeable for varying
> reasons (IIRC mostly stemming from either performance or compexity) [2].
> 
> The change that has been made so far in XFS is to turn rcuwalk for
> symlinks off once again, which looks like landed in Linus' tree as
> commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata
> buffers to the vfs"). The hope is that between that patch and this
> prospective vfs tweak, we can have a couple incremental fixes that at
> least address the practical problem users have been running into (which
> is a crash due to a NULL ->get_link() callback pointer due to inode
> reuse). The inode reuse vs. rcu thing might still be a broader problem,
> but AFAIA that mechanism has been in place in XFS on Linux pretty much
> forever.

My problem with that is that pathname resolution very much relies upon
the assumption that any inode it observes will *not* change its nature
until the final rcu_read_unlock().  Papering over ->i_op->get_link reads
in symlink case might be sufficient at the moment (I'm still not certain
about that, though), but that's rather brittle.  E.g. if some XFS change
down the road adds ->permission() on some inodes, you'll get the same
problem in do_inode_permission().  We also have places where we rely upon
	sample ->d_seq
	fetch ->d_flags
	fetch ->d_inode
	validate ->d_seq
	...
	assume that inode type matches the information in flags

How painful would it be to make xfs_destroy_inode() a ->free_inode() instance?
IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
callback context?  Note that ->destroy_inode() is called via

static void destroy_inode(struct inode *inode)
{
	const struct super_operations *ops = inode->i_sb->s_op;

	BUG_ON(!list_empty(&inode->i_lru));
	__destroy_inode(inode);
	if (ops->destroy_inode) {
		ops->destroy_inode(inode);
		if (!ops->free_inode)
			return;
	}
	inode->free_inode = ops->free_inode;
	call_rcu(&inode->i_rcu, i_callback);
}

with

static void i_callback(struct rcu_head *head)
{
        struct inode *inode = container_of(head, struct inode, i_rcu);
	if (inode->free_inode)
		inode->free_inode(inode);
	else   
		free_inode_nonrcu(inode);
}

IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
are present, ->destroy_inode() will be called synchronously, followed
by ->free_inode() from RCU callback, so you can have both - moving just
the "finally mark for reuse" part into ->free_inode() would be OK.
Any blocking stuff (if any) can be left in ->destroy_inode()...

Al Viro Jan. 17, 2022, 6:10 p.m. UTC | #5

On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:

> IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> are present, ->destroy_inode() will be called synchronously, followed
> by ->free_inode() from RCU callback, so you can have both - moving just
> the "finally mark for reuse" part into ->free_inode() would be OK.
> Any blocking stuff (if any) can be left in ->destroy_inode()...

BTW, we *do* have a problem with ext4 fast symlinks.  Pathwalk assumes that
strings it parses are not changing under it.  There are rather delicate
dances in dcache lookups re possibility of ->d_name contents changing under
it, but the search key is assumed to be stable.

What's more, there's a correctness issue even if we do not oops.  Currently
we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from
the stack.  After all, we'd just finished traversing what used to be the
contents of a symlink that used to be in the right place.  It might have been
unlinked while we'd been traversing it, but that's not a correctness issue.

But that critically depends upon the contents not getting mangled.  If it
*can* be screwed by such unlink, we risk successful lookup leading to the
wrong place, with nothing to tell us that it's happening.  We could handle
that by adding a check to fs/namei.c:put_link(), and propagating the error
to callers.  It's not impossible, but it won't be pretty.

And that assumes we avoid oopsen on string changing under us in the first
place.  Which might or might not be true - I hadn't finished the audit yet.
Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods -
we need to make sure that e.g. everything called by ->d_hash() instances
is OK with strings changing right under them.  Including utf8_to_utf32(),
crc32_le(), utf8_casefold_hash(), etc.

Brian Foster Jan. 17, 2022, 6:42 p.m. UTC | #6

On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote:
> 
> > To Al's question, at the end of the day there is no rcu delay involved
> > with inode reuse in XFS. We do use call_rcu() for eventual freeing of
> > inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that
> > have been put into a "reclaim" state before getting to the point of
> > freeing the struct inode memory. This lead to the long discussion [1]
> > Ian references around ways to potentially deal with that. I think the
> > TLDR of that thread is there are various potential options for
> > improvement, such as to rcu wait on inode creation/reuse (either
> > explicitly or via more open coded grace period cookie tracking), to rcu
> > wait somewhere in the destroy sequence before inodes become reuse
> > candidates, etc., but none of them seemingly agreeable for varying
> > reasons (IIRC mostly stemming from either performance or compexity) [2].
> > 
> > The change that has been made so far in XFS is to turn rcuwalk for
> > symlinks off once again, which looks like landed in Linus' tree as
> > commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata
> > buffers to the vfs"). The hope is that between that patch and this
> > prospective vfs tweak, we can have a couple incremental fixes that at
> > least address the practical problem users have been running into (which
> > is a crash due to a NULL ->get_link() callback pointer due to inode
> > reuse). The inode reuse vs. rcu thing might still be a broader problem,
> > but AFAIA that mechanism has been in place in XFS on Linux pretty much
> > forever.
> 
> My problem with that is that pathname resolution very much relies upon
> the assumption that any inode it observes will *not* change its nature
> until the final rcu_read_unlock().  Papering over ->i_op->get_link reads
> in symlink case might be sufficient at the moment (I'm still not certain
> about that, though), but that's rather brittle.  E.g. if some XFS change
> down the road adds ->permission() on some inodes, you'll get the same
> problem in do_inode_permission().  We also have places where we rely upon
> 	sample ->d_seq
> 	fetch ->d_flags
> 	fetch ->d_inode
> 	validate ->d_seq
> 	...
> 	assume that inode type matches the information in flags
> 
> How painful would it be to make xfs_destroy_inode() a ->free_inode() instance?
> IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
> callback context?  Note that ->destroy_inode() is called via
> 

As discussed on IRC, this was brought up in the earlier discussion by
Miklos. Dave expressed some concern around locking, but I'm not sure I
grok the details from reading back [1]. The implication seems to be the
lookup side would have to rcu wait on associated inodes in the destroy
side, which might be more of a concern about unconditional use of
free_inode() as opposed to more selective rcu waiting for unlinked (or
inactive) inodes. Dave would need to chime in further on that..

As it is, it looks to me that unlinked inodes unconditionally go to the
inactive queues and thus the create side (xfs_iget_cache_hit(), if we're
considering inode reuse) basically pushes on the queue and waits on the
inode state to clear. Given that, ISTM it shouldn't be that much
functional pain to introduce an rcu delay somewhere before an inactive
inode becomes reclaimable (and thus reusable).

I think the impediment to something like this has been more performance
related. An inode alloc/free workload can turn into a continuous reuse
of the same batch of inodes, over and over. Therefore an rcu wait on
iget reuse can become effectively unconditional and slow things down
quite a bit (hence my previous, untested thought around making it
conditional and potentially amortizing the cost). I had played with a
more selective grace period in the teardown side for inactive inodes via
queue_rcu_work(), since that's an easy place to inject an rcu
delay/callback, but that had some performance impact on sustained file
removal that might require retuning other bits..

Brian

[1] https://lore.kernel.org/linux-fsdevel/20211114231834.GM449541@dread.disaster.area/#t

> static void destroy_inode(struct inode *inode)
> {
> 	const struct super_operations *ops = inode->i_sb->s_op;
> 
> 	BUG_ON(!list_empty(&inode->i_lru));
> 	__destroy_inode(inode);
> 	if (ops->destroy_inode) {
> 		ops->destroy_inode(inode);
> 		if (!ops->free_inode)
> 			return;
> 	}
> 	inode->free_inode = ops->free_inode;
> 	call_rcu(&inode->i_rcu, i_callback);
> }
> 
> with
> 
> static void i_callback(struct rcu_head *head)
> {
>         struct inode *inode = container_of(head, struct inode, i_rcu);
> 	if (inode->free_inode)
> 		inode->free_inode(inode);
> 	else   
> 		free_inode_nonrcu(inode);
> }
> 
> IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> are present, ->destroy_inode() will be called synchronously, followed
> by ->free_inode() from RCU callback, so you can have both - moving just
> the "finally mark for reuse" part into ->free_inode() would be OK.
> Any blocking stuff (if any) can be left in ->destroy_inode()...
>

Al Viro Jan. 17, 2022, 7:48 p.m. UTC | #7

On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> 
> > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> > are present, ->destroy_inode() will be called synchronously, followed
> > by ->free_inode() from RCU callback, so you can have both - moving just
> > the "finally mark for reuse" part into ->free_inode() would be OK.
> > Any blocking stuff (if any) can be left in ->destroy_inode()...
> 
> BTW, we *do* have a problem with ext4 fast symlinks.  Pathwalk assumes that
> strings it parses are not changing under it.  There are rather delicate
> dances in dcache lookups re possibility of ->d_name contents changing under
> it, but the search key is assumed to be stable.
> 
> What's more, there's a correctness issue even if we do not oops.  Currently
> we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from
> the stack.  After all, we'd just finished traversing what used to be the
> contents of a symlink that used to be in the right place.  It might have been
> unlinked while we'd been traversing it, but that's not a correctness issue.
> 
> But that critically depends upon the contents not getting mangled.  If it
> *can* be screwed by such unlink, we risk successful lookup leading to the
> wrong place, with nothing to tell us that it's happening.  We could handle
> that by adding a check to fs/namei.c:put_link(), and propagating the error
> to callers.  It's not impossible, but it won't be pretty.
> 
> And that assumes we avoid oopsen on string changing under us in the first
> place.  Which might or might not be true - I hadn't finished the audit yet.
> Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods -
> we need to make sure that e.g. everything called by ->d_hash() instances
> is OK with strings changing right under them.  Including utf8_to_utf32(),
> crc32_le(), utf8_casefold_hash(), etc.

And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and
the call chains there are deep enough for me to miss something) have the
"bugger the contents of string returned by RCU ->get_link() if unlink()
happens" problem.

I would very much prefer to have them deal with that crap, especially
since I don't see why does ext4_evict_inode() need to do that memset() -
can't we simply check ->i_op in ext4_can_truncate() and be done with
that?

Al Viro Jan. 18, 2022, 1:32 a.m. UTC | #8

On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote:
> > But that critically depends upon the contents not getting mangled.  If it
> > *can* be screwed by such unlink, we risk successful lookup leading to the
> > wrong place, with nothing to tell us that it's happening.  We could handle
> > that by adding a check to fs/namei.c:put_link(), and propagating the error
> > to callers.  It's not impossible, but it won't be pretty.
> > 
> > And that assumes we avoid oopsen on string changing under us in the first
> > place.  Which might or might not be true - I hadn't finished the audit yet.
> > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods -
> > we need to make sure that e.g. everything called by ->d_hash() instances
> > is OK with strings changing right under them.  Including utf8_to_utf32(),
> > crc32_le(), utf8_casefold_hash(), etc.
> 
> And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and
> the call chains there are deep enough for me to miss something) have the
> "bugger the contents of string returned by RCU ->get_link() if unlink()
> happens" problem.
> 
> I would very much prefer to have them deal with that crap, especially
> since I don't see why does ext4_evict_inode() need to do that memset() -
> can't we simply check ->i_op in ext4_can_truncate() and be done with
> that?

This reuse-without-delay has another fun side, AFAICS.  Suppose the new use
for inode comes with the same ->i_op (i.e. it's a symlink again) and it
happens right after ->get_link() has returned the pointer to body.

We are already past whatever checks we might add in pick_link().  And the
pointer is still valid.  So we end up quietly traversing the body of
completely unrelated symlink that never had been anywhere near any directory
we might be looking at.  With no indication of anything going wrong - just
a successful resolution with bogus result.

Could XFS folks explain what exactly goes wrong if we make actual marking
inode as ready for reuse RCU-delayed, by shifting just that into
->free_inode()?  Why would we need any extra synchronize_rcu() anywhere?

Ian Kent Jan. 18, 2022, 2:31 a.m. UTC | #9

On Tue, 2022-01-18 at 01:32 +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote:
> > > But that critically depends upon the contents not getting
> > > mangled.  If it
> > > *can* be screwed by such unlink, we risk successful lookup
> > > leading to the
> > > wrong place, with nothing to tell us that it's happening.  We
> > > could handle
> > > that by adding a check to fs/namei.c:put_link(), and propagating
> > > the error
> > > to callers.  It's not impossible, but it won't be pretty.
> > > 
> > > And that assumes we avoid oopsen on string changing under us in
> > > the first
> > > place.  Which might or might not be true - I hadn't finished the
> > > audit yet.
> > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs
> > > methods -
> > > we need to make sure that e.g. everything called by ->d_hash()
> > > instances
> > > is OK with strings changing right under them.  Including
> > > utf8_to_utf32(),
> > > crc32_le(), utf8_casefold_hash(), etc.
> > 
> > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that
> > one and
> > the call chains there are deep enough for me to miss something)
> > have the
> > "bugger the contents of string returned by RCU ->get_link() if
> > unlink()
> > happens" problem.
> > 
> > I would very much prefer to have them deal with that crap,
> > especially
> > since I don't see why does ext4_evict_inode() need to do that
> > memset() -
> > can't we simply check ->i_op in ext4_can_truncate() and be done
> > with
> > that?
> 
> This reuse-without-delay has another fun side, AFAICS.  Suppose the
> new use
> for inode comes with the same ->i_op (i.e. it's a symlink again) and
> it
> happens right after ->get_link() has returned the pointer to body.
> 
> We are already past whatever checks we might add in pick_link().  And
> the
> pointer is still valid.  So we end up quietly traversing the body of
> completely unrelated symlink that never had been anywhere near any
> directory
> we might be looking at.  With no indication of anything going wrong -
> just
> a successful resolution with bogus result.

Wouldn't that case be caught by the unlazy call since ->get_link()
needs to return -ECHILD for the rcu case now (in xfs anyway)?

> 
> Could XFS folks explain what exactly goes wrong if we make actual
> marking
> inode as ready for reuse RCU-delayed, by shifting just that into
> ->free_inode()?  Why would we need any extra synchronize_rcu()
> anywhere?

Dave Chinner Jan. 18, 2022, 3 a.m. UTC | #10

On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote:
> 
> > To Al's question, at the end of the day there is no rcu delay involved
> > with inode reuse in XFS. We do use call_rcu() for eventual freeing of
> > inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that
> > have been put into a "reclaim" state before getting to the point of
> > freeing the struct inode memory. This lead to the long discussion [1]
> > Ian references around ways to potentially deal with that. I think the
> > TLDR of that thread is there are various potential options for
> > improvement, such as to rcu wait on inode creation/reuse (either
> > explicitly or via more open coded grace period cookie tracking), to rcu
> > wait somewhere in the destroy sequence before inodes become reuse
> > candidates, etc., but none of them seemingly agreeable for varying
> > reasons (IIRC mostly stemming from either performance or compexity) [2].
> > 
> > The change that has been made so far in XFS is to turn rcuwalk for
> > symlinks off once again, which looks like landed in Linus' tree as
> > commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata
> > buffers to the vfs"). The hope is that between that patch and this
> > prospective vfs tweak, we can have a couple incremental fixes that at
> > least address the practical problem users have been running into (which
> > is a crash due to a NULL ->get_link() callback pointer due to inode
> > reuse). The inode reuse vs. rcu thing might still be a broader problem,
> > but AFAIA that mechanism has been in place in XFS on Linux pretty much
> > forever.
> 
> My problem with that is that pathname resolution very much relies upon
> the assumption that any inode it observes will *not* change its nature
> until the final rcu_read_unlock().  Papering over ->i_op->get_link reads
> in symlink case might be sufficient at the moment (I'm still not certain
> about that, though), but that's rather brittle.  E.g. if some XFS change
> down the road adds ->permission() on some inodes, you'll get the same
> problem in do_inode_permission().  We also have places where we rely upon
> 	sample ->d_seq
> 	fetch ->d_flags
> 	fetch ->d_inode
> 	validate ->d_seq
> 	...
> 	assume that inode type matches the information in flags
> 
> How painful would it be to make xfs_destroy_inode() a ->free_inode() instance?
> IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
> callback context?

AIUI, not very close at all,

I'm pretty sure we can't put it under RCU callback context at all
because xfs_fs_destroy_inode() can take sleeping locks, perform
transactions, do IO, run rcu_read_lock() critical sections, etc.
This means that needs to run an a full task context and so can't run
from RCU callback context at all.

> IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> are present, ->destroy_inode() will be called synchronously, followed
> by ->free_inode() from RCU callback, so you can have both - moving just
> the "finally mark for reuse" part into ->free_inode() would be OK.
> Any blocking stuff (if any) can be left in ->destroy_inode()...

Yup, that's pretty much what we already do, except that we run the
RCU-delayed part of freeing the inode once XFS has finished with the
inode internally and the background inode GC reclaims it.

Cheers,

Dave.

Al Viro Jan. 18, 2022, 3:03 a.m. UTC | #11

On Tue, Jan 18, 2022 at 10:31:53AM +0800, Ian Kent wrote:

> Wouldn't that case be caught by the unlazy call since ->get_link()
> needs to return -ECHILD for the rcu case now (in xfs anyway)?

*shrug*

that'd solve the problem, all right, but it's a serious overkill in
several respects.

Al Viro Jan. 18, 2022, 3:17 a.m. UTC | #12

On Tue, Jan 18, 2022 at 02:00:41PM +1100, Dave Chinner wrote:
> > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
> > callback context?
> 
> AIUI, not very close at all,
> 
> I'm pretty sure we can't put it under RCU callback context at all
> because xfs_fs_destroy_inode() can take sleeping locks, perform
> transactions, do IO, run rcu_read_lock() critical sections, etc.
> This means that needs to run an a full task context and so can't run
> from RCU callback context at all.

Umm...  AFAICS, this
	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
	spin_lock(&pag->pag_ici_lock);
	spin_lock(&ip->i_flags_lock);

	trace_xfs_inode_set_reclaimable(ip);
	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
	ip->i_flags |= XFS_IRECLAIMABLE;
	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
	XFS_ICI_RECLAIM_TAG);

	spin_unlock(&ip->i_flags_lock);
	spin_unlock(&pag->pag_ici_lock);
	xfs_perag_put(pag);
in the end of xfs_inodegc_set_reclaimable() could go into ->free_inode()
just fine.  It's xfs_inodegc_queue() I'm not sure about - the part
about flush_work() in there...

I'm not familiar with that code; could you point me towards some
docs/old postings/braindump/whatnot?

Dave Chinner Jan. 18, 2022, 4:12 a.m. UTC | #13

On Tue, Jan 18, 2022 at 03:17:13AM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 02:00:41PM +1100, Dave Chinner wrote:
> > > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU
> > > callback context?
> > 
> > AIUI, not very close at all,
> > 
> > I'm pretty sure we can't put it under RCU callback context at all
> > because xfs_fs_destroy_inode() can take sleeping locks, perform
> > transactions, do IO, run rcu_read_lock() critical sections, etc.
> > This means that needs to run an a full task context and so can't run
> > from RCU callback context at all.
> 
> Umm...  AFAICS, this
> 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> 	spin_lock(&pag->pag_ici_lock);
> 	spin_lock(&ip->i_flags_lock);
> 
> 	trace_xfs_inode_set_reclaimable(ip);
> 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
> 	ip->i_flags |= XFS_IRECLAIMABLE;
> 	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> 	XFS_ICI_RECLAIM_TAG);
> 
> 	spin_unlock(&ip->i_flags_lock);
> 	spin_unlock(&pag->pag_ici_lock);
> 	xfs_perag_put(pag);
>
> in the end of xfs_inodegc_set_reclaimable() could go into ->free_inode()
> just fine.

No, that just creates a black hole where the VFS inode has been
destroyed but the XFS inode cache doesn't know it's been trashed.
Hence setting XFS_IRECLAIMABLE needs to remain in the during
->destroy_inode, otherwise the ->lookup side of the cache will think
that are currently still in use by the VFS and hand them straight
back out without going through the inode recycling code.

i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
part of the inode has been torn down, and that it must go back
through VFS re-initialisation before it can be re-instantiated as a
VFS inode.

It would also mean that the inode will need to go through two RCU
grace periods before it gets reclaimed, because XFS uses RCU
protected inode cache lookups internally (e.g. for clustering dirty
inode writeback) and so freeing the inode from the internal
XFS inode cache requires RCU freeing...

> It's xfs_inodegc_queue() I'm not sure about - the part
> about flush_work() in there...

That's the bit that prevents unbound queuing of inodes that require
inactivation because of the problems that causes memory reclaim and
performance. It blocks waiting for xfs_inactive()
calls to complete via xfs_inodegc_worker(), and it's the
xfs_inactive() calls that do all the IO, transactions, locking, etc.

So if we block waiting for them to complete in RCU callback context,
we're effectively inverting the current locking order for entire
filesystem w.r.t. RCU...

Also worth considering: if we are processing the unlink of an inode
with tens or hundreds of millions of extents in xfs_inactive(), that
flush_work() call could block for *minutes* waiting on inactivation
to run the millions of transactions needed to free that inode.
Regardless of lock ordering, we don't want to block RCU grace period
callback work for that long....

> I'm not familiar with that code; could you point me towards some
> docs/old postings/braindump/whatnot?

Commit ab23a7768739 ("xfs: per-cpu deferred inode inactivation
queues") explains the per-cpu queuing mechanism that is being used
in this code.

Cheers,

Dave.

Al Viro Jan. 18, 2022, 5:58 a.m. UTC | #14

On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:

> No, that just creates a black hole where the VFS inode has been
> destroyed but the XFS inode cache doesn't know it's been trashed.
> Hence setting XFS_IRECLAIMABLE needs to remain in the during
> ->destroy_inode, otherwise the ->lookup side of the cache will think
> that are currently still in use by the VFS and hand them straight
> back out without going through the inode recycling code.
> 
> i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
> part of the inode has been torn down, and that it must go back
> through VFS re-initialisation before it can be re-instantiated as a
> VFS inode.

OK...

> It would also mean that the inode will need to go through two RCU
> grace periods before it gets reclaimed, because XFS uses RCU
> protected inode cache lookups internally (e.g. for clustering dirty
> inode writeback) and so freeing the inode from the internal
> XFS inode cache requires RCU freeing...

Wait a minute.  Where is that RCU delay of yours, relative to
xfs_vn_unlink() and xfs_vn_rename() (for target)?  And where does
it happen in case of e.g. open() + unlink() + close()?

Christian Brauner Jan. 18, 2022, 8:29 a.m. UTC | #15

On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> 
> > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> > are present, ->destroy_inode() will be called synchronously, followed
> > by ->free_inode() from RCU callback, so you can have both - moving just
> > the "finally mark for reuse" part into ->free_inode() would be OK.
> > Any blocking stuff (if any) can be left in ->destroy_inode()...
> 
> BTW, we *do* have a problem with ext4 fast symlinks.  Pathwalk assumes that
> strings it parses are not changing under it.  There are rather delicate
> dances in dcache lookups re possibility of ->d_name contents changing under
> it, but the search key is assumed to be stable.
> 
> What's more, there's a correctness issue even if we do not oops.  Currently
> we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from
> the stack.  After all, we'd just finished traversing what used to be the
> contents of a symlink that used to be in the right place.  It might have been
> unlinked while we'd been traversing it, but that's not a correctness issue.
> 
> But that critically depends upon the contents not getting mangled.  If it
> *can* be screwed by such unlink, we risk successful lookup leading to the

Out of curiosity: whether or not it can get mangled depends on the
filesystem and how it implements fast symlinks or do fast symlinks
currently guarantee that contents are mangled?

Brian Foster Jan. 18, 2022, 1:47 p.m. UTC | #16

On Tue, Jan 18, 2022 at 01:32:23AM +0000, Al Viro wrote:
> On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote:
> > > But that critically depends upon the contents not getting mangled.  If it
> > > *can* be screwed by such unlink, we risk successful lookup leading to the
> > > wrong place, with nothing to tell us that it's happening.  We could handle
> > > that by adding a check to fs/namei.c:put_link(), and propagating the error
> > > to callers.  It's not impossible, but it won't be pretty.
> > > 
> > > And that assumes we avoid oopsen on string changing under us in the first
> > > place.  Which might or might not be true - I hadn't finished the audit yet.
> > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods -
> > > we need to make sure that e.g. everything called by ->d_hash() instances
> > > is OK with strings changing right under them.  Including utf8_to_utf32(),
> > > crc32_le(), utf8_casefold_hash(), etc.
> > 
> > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and
> > the call chains there are deep enough for me to miss something) have the
> > "bugger the contents of string returned by RCU ->get_link() if unlink()
> > happens" problem.
> > 
> > I would very much prefer to have them deal with that crap, especially
> > since I don't see why does ext4_evict_inode() need to do that memset() -
> > can't we simply check ->i_op in ext4_can_truncate() and be done with
> > that?
> 
> This reuse-without-delay has another fun side, AFAICS.  Suppose the new use
> for inode comes with the same ->i_op (i.e. it's a symlink again) and it
> happens right after ->get_link() has returned the pointer to body.
> 

Yep, I had reproduced this explicitly when playing around with some
instrumented delays and whatnot in the code. This and the similar
variant of just returning internal/non-string data fork metadata via
->get_link() is why I asked to restore old behavior of returning -ECHILD
for inline symlinks.

> We are already past whatever checks we might add in pick_link().  And the
> pointer is still valid.  So we end up quietly traversing the body of
> completely unrelated symlink that never had been anywhere near any directory
> we might be looking at.  With no indication of anything going wrong - just
> a successful resolution with bogus result.
> 
> Could XFS folks explain what exactly goes wrong if we make actual marking
> inode as ready for reuse RCU-delayed, by shifting just that into
> ->free_inode()?  Why would we need any extra synchronize_rcu() anywhere?
> 

Dave already chimed in on why we probably don't want ->free_inode()
across the board. I don't think there's a functional problem with a more
selective injection of an rcu delay on the INACTIVE -> RECLAIMABLE
transition, based on the reasoning specified earlier (i.e., the iget
side already blocks on INACTIVE, so it's just a matter of a longer
delay).

Most of that long thread I previously linked to was us discussing pretty
much how to do something like that with minimal performance impact. The
experiment I ran to measure performance was use of queue_rcu_work() for
inactive inode processing. That resulted in a performance hit to single
threaded sequential file removal, but could be mitigated by increasing
the queue size (which may or may not have other side effects). Dave
suggested a more async approach to track the current grace period in the
inode and refer to it at lookup/alloc time, but that is notably more
involved and isn't clear if/how much it mitigates rcu delays.

IIUC, your thought here is to introduce an rcu delay on the destroy
side, but after the inactive processing rather than before it (as my
previous experiment did). IOW, basically invoke
xfs_inodegc_set_reclaimable() as an rcu callback via
xfs_inodegc_worker(), yes? If so, that seems like a potentially
reasonable option to me since it pulls the delay out of the inactivation
processing pipeline. I suspect the tradeoff with that is it might be
slightly less efficient than doing it earlier because we've lost any
grace period transitions that have occurred since before the inode was
queued and processed, but OTOH this might isolate the impact of that
delay to the inode reuse path. Maybe there's room for a simple
optimization there in cases where a gp may have expired already since
the inode was first queued. Hmm.. maybe I'll give that a try to see
if/how much impact there may be on an inode alloc/free workload..

Brian

Al Viro Jan. 18, 2022, 4:04 p.m. UTC | #17

On Tue, Jan 18, 2022 at 09:29:11AM +0100, Christian Brauner wrote:
> On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote:
> > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> > 
> > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> > > are present, ->destroy_inode() will be called synchronously, followed
> > > by ->free_inode() from RCU callback, so you can have both - moving just
> > > the "finally mark for reuse" part into ->free_inode() would be OK.
> > > Any blocking stuff (if any) can be left in ->destroy_inode()...
> > 
> > BTW, we *do* have a problem with ext4 fast symlinks.  Pathwalk assumes that
> > strings it parses are not changing under it.  There are rather delicate
> > dances in dcache lookups re possibility of ->d_name contents changing under
> > it, but the search key is assumed to be stable.
> > 
> > What's more, there's a correctness issue even if we do not oops.  Currently
> > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from
> > the stack.  After all, we'd just finished traversing what used to be the
> > contents of a symlink that used to be in the right place.  It might have been
> > unlinked while we'd been traversing it, but that's not a correctness issue.
> > 
> > But that critically depends upon the contents not getting mangled.  If it
> > *can* be screwed by such unlink, we risk successful lookup leading to the
> 
> Out of curiosity: whether or not it can get mangled depends on the
> filesystem and how it implements fast symlinks or do fast symlinks
> currently guarantee that contents are mangled?

Not sure if I understand your question correctly...

	Filesystems should guarantee that the contents of string returned
by ->get_link() (or pointed to by ->i_link) remains unchanged for as long
as we are looking at it (until fs/namei.c:put_link() that drops it or
fs/namei.c:drop_links() in the end of pathwalk).  Fast symlinks or not -
doesn't matter.

	The only cases where that does not hold (there are two of them in
the entire kernel) happen to be fast symlinks.	Both cases are bugs.
ext4 case is actually easy to fix - the only reason it ends up mangling
the string is the way ext4_truncate() implements its check for victim
being a fast symlink (and thus needing no work).  It gets disrupted
by zeroing ->i_size, which we need to do in this case (inode removal).
That's not hard to get right.

	A plenty of other fast symlink variants (starting with ext2 ones,
BTW) do not step into anything of that sort.  IIRC, we used to have at
least some cases (orangefs, perhaps?) where revalidate on a symlink
might (with confused or malicious server) end up modifying the contents,
possibly right under somebody else walking that symlink.  Also a bug...

Brian Foster Jan. 18, 2022, 6:25 p.m. UTC | #18

On Tue, Jan 18, 2022 at 08:47:53AM -0500, Brian Foster wrote:
> On Tue, Jan 18, 2022 at 01:32:23AM +0000, Al Viro wrote:
> > On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote:
> > > > But that critically depends upon the contents not getting mangled.  If it
> > > > *can* be screwed by such unlink, we risk successful lookup leading to the
> > > > wrong place, with nothing to tell us that it's happening.  We could handle
> > > > that by adding a check to fs/namei.c:put_link(), and propagating the error
> > > > to callers.  It's not impossible, but it won't be pretty.
> > > > 
> > > > And that assumes we avoid oopsen on string changing under us in the first
> > > > place.  Which might or might not be true - I hadn't finished the audit yet.
> > > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods -
> > > > we need to make sure that e.g. everything called by ->d_hash() instances
> > > > is OK with strings changing right under them.  Including utf8_to_utf32(),
> > > > crc32_le(), utf8_casefold_hash(), etc.
> > > 
> > > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and
> > > the call chains there are deep enough for me to miss something) have the
> > > "bugger the contents of string returned by RCU ->get_link() if unlink()
> > > happens" problem.
> > > 
> > > I would very much prefer to have them deal with that crap, especially
> > > since I don't see why does ext4_evict_inode() need to do that memset() -
> > > can't we simply check ->i_op in ext4_can_truncate() and be done with
> > > that?
> > 
> > This reuse-without-delay has another fun side, AFAICS.  Suppose the new use
> > for inode comes with the same ->i_op (i.e. it's a symlink again) and it
> > happens right after ->get_link() has returned the pointer to body.
> > 
> 
> Yep, I had reproduced this explicitly when playing around with some
> instrumented delays and whatnot in the code. This and the similar
> variant of just returning internal/non-string data fork metadata via
> ->get_link() is why I asked to restore old behavior of returning -ECHILD
> for inline symlinks.
> 
> > We are already past whatever checks we might add in pick_link().  And the
> > pointer is still valid.  So we end up quietly traversing the body of
> > completely unrelated symlink that never had been anywhere near any directory
> > we might be looking at.  With no indication of anything going wrong - just
> > a successful resolution with bogus result.
> > 
> > Could XFS folks explain what exactly goes wrong if we make actual marking
> > inode as ready for reuse RCU-delayed, by shifting just that into
> > ->free_inode()?  Why would we need any extra synchronize_rcu() anywhere?
> > 
> 
> Dave already chimed in on why we probably don't want ->free_inode()
> across the board. I don't think there's a functional problem with a more
> selective injection of an rcu delay on the INACTIVE -> RECLAIMABLE
> transition, based on the reasoning specified earlier (i.e., the iget
> side already blocks on INACTIVE, so it's just a matter of a longer
> delay).
> 
> Most of that long thread I previously linked to was us discussing pretty
> much how to do something like that with minimal performance impact. The
> experiment I ran to measure performance was use of queue_rcu_work() for
> inactive inode processing. That resulted in a performance hit to single
> threaded sequential file removal, but could be mitigated by increasing
> the queue size (which may or may not have other side effects). Dave
> suggested a more async approach to track the current grace period in the
> inode and refer to it at lookup/alloc time, but that is notably more
> involved and isn't clear if/how much it mitigates rcu delays.
> 
> IIUC, your thought here is to introduce an rcu delay on the destroy
> side, but after the inactive processing rather than before it (as my
> previous experiment did). IOW, basically invoke
> xfs_inodegc_set_reclaimable() as an rcu callback via
> xfs_inodegc_worker(), yes? If so, that seems like a potentially
> reasonable option to me since it pulls the delay out of the inactivation
> processing pipeline. I suspect the tradeoff with that is it might be
> slightly less efficient than doing it earlier because we've lost any
> grace period transitions that have occurred since before the inode was
> queued and processed, but OTOH this might isolate the impact of that
> delay to the inode reuse path. Maybe there's room for a simple
> optimization there in cases where a gp may have expired already since
> the inode was first queued. Hmm.. maybe I'll give that a try to see
> if/how much impact there may be on an inode alloc/free workload..
> 

Here are results from a few quick tests running a tight inode alloc/free
loop against an XFS filesystem with some concurrent kernel source tree
recursive ls activity running on a separate (XFS) filesystem in the
background. The loop runs for 60s and reports how many create/unlink
cycles it was able to perform in that time:

Baseline (xfs for-next, v5.16.0-rc5):	~34k cycles
inactive -> reclaimable grace period:	~29k cycles
unconditional iget rcu sync:		~4400 cycles

Note that I did get a lockdep splat from _set_reclaimable() in rcu
context that I've ignored for now because the test completed. That
aside, that looks like about a ~15% or so hit from baseline. The last
test inserts an unconditional synchronize_rcu() in the iget recycle path
to provide a reference for the likely worst case implementation.

If I go back to the inactive -> reclaimable grace period variant and
also insert a start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() pair across the inactive processing
sequence, I start seeing numbers closer to ~36k cycles. IOW, the
xfs_inodegc_inactivate() helper looks something like this:

        if (poll_state_synchronize_rcu(ip->i_destroy_gp))
                xfs_inodegc_set_reclaimable(ip);
        else
                call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback);

... to skip the rcu grace period if one had already passed while the
inode sat on the inactivation queue and was processed.

However my box went haywire shortly after with rcu stall reports or
something if I let that continue to run longer, so I'll probably have to
look into that lockdep splat (complaining about &pag->pag_ici_lock in
rcu context, perhaps needs to become irq safe?) or see if something else
is busted..

Brian

> Brian

Al Viro Jan. 18, 2022, 7:20 p.m. UTC | #19

On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote:

> If I go back to the inactive -> reclaimable grace period variant and
> also insert a start_poll_synchronize_rcu() and
> poll_state_synchronize_rcu() pair across the inactive processing
> sequence, I start seeing numbers closer to ~36k cycles. IOW, the
> xfs_inodegc_inactivate() helper looks something like this:
> 
>         if (poll_state_synchronize_rcu(ip->i_destroy_gp))
>                 xfs_inodegc_set_reclaimable(ip);
>         else
>                 call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback);
> 
> ... to skip the rcu grace period if one had already passed while the
> inode sat on the inactivation queue and was processed.
> 
> However my box went haywire shortly after with rcu stall reports or
> something if I let that continue to run longer, so I'll probably have to
> look into that lockdep splat (complaining about &pag->pag_ici_lock in
> rcu context, perhaps needs to become irq safe?) or see if something else
> is busted..

Umm...  Could you (or Dave) describe where does the mainline do the
RCU delay mentioned several times in these threads, in case of
	* unlink()
	* overwriting rename()
	* open() + unlink() + close() (that one, obviously, for regular files)

The thing is, if we already do have an RCU delay in there, there might be
a better solution - making sure it happens downstream of d_drop() (in case
when dentry has other references) or dentry going negative (in case we are
holding the sole reference to it).

If we can do that, RCU'd dcache lookups won't run into inode reuse at all.
Sure, right now d_delete() comes last *and* we want the target inode to stay
around past the call of ->unlink().  But if you look at do_unlinkat() you'll
see an interesting-looking wart with ihold/iput around vfs_unlink().  Not
sure if you remember the story on that one; basically, it's "we'd rather
have possible IO on inode freeing to happen outside of the lock on parent".

nfsd and mqueue do the same thing; ksmbd does not.  OTOH, ksmbd appears to
force the "make victim go unhashed, sole reference or not".  ecryptfs
definitely does that forcing (deliberately so).

That area could use some rethinking, and if we can deal with the inode reuse
delay while we are at it...

Overwriting rename() is also interesting in that respect, of course.

I can go and try to RTFS my way through xfs iget-related code, but I'd
obviously prefer to do so with at least some overview of that thing
from the folks familiar with it.  Seeing that it's a lockless search
structure, missing something subtle there is all too easy, especially
with the lookup-by-fhandle stuff in the mix...

Brian Foster Jan. 18, 2022, 8:58 p.m. UTC | #20

On Tue, Jan 18, 2022 at 07:20:35PM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote:
> 
> > If I go back to the inactive -> reclaimable grace period variant and
> > also insert a start_poll_synchronize_rcu() and
> > poll_state_synchronize_rcu() pair across the inactive processing
> > sequence, I start seeing numbers closer to ~36k cycles. IOW, the
> > xfs_inodegc_inactivate() helper looks something like this:
> > 
> >         if (poll_state_synchronize_rcu(ip->i_destroy_gp))
> >                 xfs_inodegc_set_reclaimable(ip);
> >         else
> >                 call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback);
> > 
> > ... to skip the rcu grace period if one had already passed while the
> > inode sat on the inactivation queue and was processed.
> > 
> > However my box went haywire shortly after with rcu stall reports or
> > something if I let that continue to run longer, so I'll probably have to
> > look into that lockdep splat (complaining about &pag->pag_ici_lock in
> > rcu context, perhaps needs to become irq safe?) or see if something else
> > is busted..
> 
> Umm...  Could you (or Dave) describe where does the mainline do the
> RCU delay mentioned several times in these threads, in case of
> 	* unlink()
> 	* overwriting rename()
> 	* open() + unlink() + close() (that one, obviously, for regular files)
> 

If you're referring to the existing rcu delay in XFS, I suspect that
refers to xfs_reclaim_inode(). The last bits of that function remove the
inode from the perag tree and then calls __xfs_inode_free(), which frees
the inode via rcu callback.

BTW, I think the experiment above is either going to require an audit to
make the various _set_reclaimable() locks irq safe or something a bit
more ugly like putting the inode back on a workqueue after the rcu delay
to make the state transition. Given the incremental improvement from
using start_poll_synchronize_rcu(), I replaced the above destroy side
code with a cond_synchronize_rcu(ip->i_destroy_gp) call on the
iget/recycle side and see similar results (~36k cycles per 60s, but so
far without any explosions).

Brian

> The thing is, if we already do have an RCU delay in there, there might be
> a better solution - making sure it happens downstream of d_drop() (in case
> when dentry has other references) or dentry going negative (in case we are
> holding the sole reference to it).
> 
> If we can do that, RCU'd dcache lookups won't run into inode reuse at all.
> Sure, right now d_delete() comes last *and* we want the target inode to stay
> around past the call of ->unlink().  But if you look at do_unlinkat() you'll
> see an interesting-looking wart with ihold/iput around vfs_unlink().  Not
> sure if you remember the story on that one; basically, it's "we'd rather
> have possible IO on inode freeing to happen outside of the lock on parent".
> 
> nfsd and mqueue do the same thing; ksmbd does not.  OTOH, ksmbd appears to
> force the "make victim go unhashed, sole reference or not".  ecryptfs
> definitely does that forcing (deliberately so).
> 
> That area could use some rethinking, and if we can deal with the inode reuse
> delay while we are at it...
> 
> Overwriting rename() is also interesting in that respect, of course.
> 
> I can go and try to RTFS my way through xfs iget-related code, but I'd
> obviously prefer to do so with at least some overview of that thing
> from the folks familiar with it.  Seeing that it's a lockless search
> structure, missing something subtle there is all too easy, especially
> with the lookup-by-fhandle stuff in the mix...
>

Dave Chinner Jan. 18, 2022, 11:25 p.m. UTC | #21

On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> 
> > No, that just creates a black hole where the VFS inode has been
> > destroyed but the XFS inode cache doesn't know it's been trashed.
> > Hence setting XFS_IRECLAIMABLE needs to remain in the during
> > ->destroy_inode, otherwise the ->lookup side of the cache will think
> > that are currently still in use by the VFS and hand them straight
> > back out without going through the inode recycling code.
> > 
> > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
> > part of the inode has been torn down, and that it must go back
> > through VFS re-initialisation before it can be re-instantiated as a
> > VFS inode.
> 
> OK...
> 
> > It would also mean that the inode will need to go through two RCU
> > grace periods before it gets reclaimed, because XFS uses RCU
> > protected inode cache lookups internally (e.g. for clustering dirty
> > inode writeback) and so freeing the inode from the internal
> > XFS inode cache requires RCU freeing...
> 
> Wait a minute.  Where is that RCU delay of yours, relative to
> xfs_vn_unlink() and xfs_vn_rename() (for target)?

Both of those drop the inode on an on-disk unlinked list. When the
last reference goes away, ->destroy_inode then runs inactivation.

Inactivation then runs transactions to free all the space attached
to the inode and then removes the inode from the unlinked list and
frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty
in memory. It can't be reclaimed until the inode is written to disk
or the whole inode cluster is freed and the inode marked XFS_ISTALE
(so won't get written back).

At that point, a background inode reclaim thread (runs every 5s)
does a RCU protected lockless radix tree walk to find
XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it
moves them to XFS_IRECLAIM state, deletes them from the radix tree
and frees them via a call_rcu() callback.

If memory reclaim comes along sooner than this, the
->free_cached_objects() superblock shrinker callback runs that RCU
protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes.

> And where does
> it happen in case of e.g. open() + unlink() + close()?

Same thing - close() drops the last reference, the unlinked inode
goes through inactivation, then moves into the XFS_IRECLAIMABLE
state.

The problem is not -quite- open-unlink-close. The problem case is
the reallocation of an on-disk inode in the case of
unlink-close-open(O_CREATE) operations because of the on-disk inode
allocator policy of aggressive reuse of recently freed inodes.  In
that case the xfs_iget() lookup will reinstantiate the inode via
xfs_iget_recycle() and the inode will change identity between VFS
instantiations.

This is where a RCU grace period is absolutely required, and we
don't currently have one. The bug was introduced with RCU freeing of
inodes (what, 15 years ago now?) and it's only recently that we've
realised this bug exists via code inspection. We really have no
evidence that it's actually been tripped over in the wild....

Unfortunately, the simple fix of adding syncronize_rcu() to
xfs_iget_recycle() causes significant performance regressions
because we hit this path quite frequently when workloads use lots of
temporary files - the on-disk inode allocator policy tends towards
aggressive re-use of inodes for small sets of temporary files.

The problem XFS is trying to address is that the VFS inode lifecycle
does not cater for filesystems that need to both dirty and then
clean unlinked inodes between iput_final() and ->destroy_inode. It's
too late to be able to put the inode back on the LRU once we've
decided to drop the inode if we need to dirty it again. ANd because
evict() is part of the non-blocking memory reclaim, we aren't
supposed to block for arbitrarily long periods of time or create
unbound memory demand processing inode eviction (both of which XFS
can do in inactivation).

IOWs, XFS can't free the inode until it's journal releases the
internal reference on the dirty inode. ext4 doesn't track inodes in
it's journal - it only tracks inode buffers that contain the changes
made to the inode, so once the transaction is committed in
ext4_evict_inode() the inode can be immediately freed via either
->destroy_inode or ->free_inode. That option does not exist for XFS
because we have to wait for the journal to finish with the inode
before it can be freed. Hence all the background reclaim stuff.

We've recently solved several of the problems we need to solve to
reduce the mismatch; avoiding blocking on inode writeback in reclaim
and background inactivation are two of the major pieces of work we
needed done before we could even consider more closely aligning XFS
to the VFS inode cache life cycle model.

The next step is to move the background inode inactivation triggers
up into ->drop_inode so we can catch inodes that need to be dirtied
by the filesysetm before they have been marked for eviction by the
VFS. This will allow us to keep the inode on the VFS LRU (probably
marked with I_WILL_FREE so everyone else keeps away from it) whilst
we are waiting for the background inactivation work to be done, the
journal flushed and the metadata written back. Once clean, we can
directly evict the inode from the VFS ourselves.

This would mean we only get clean, reclaimable inodes hitting the
evict() path, and so at that point we can just remove the inode
directly from the XFS inode cache from either ->destroy_inode or
->free_inode and RCU free it. The recycling of in-memory inodes in
xfs_iget_cache_hit can go away entirely because no inodes will
linger in the XFS inode cache without being visible at the VFS
layer as they do now...

That's going to take a fair bit of work to realise, and I'm not sure
yet exactly what mods are going to be needed to either the VFS inode
infrastructure or the XFS inode cache. 

Cheers,

Dave.

Christian Brauner Jan. 19, 2022, 9:05 a.m. UTC | #22

On Tue, Jan 18, 2022 at 04:04:50PM +0000, Al Viro wrote:
> On Tue, Jan 18, 2022 at 09:29:11AM +0100, Christian Brauner wrote:
> > On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote:
> > > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote:
> > > 
> > > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode().  If both
> > > > are present, ->destroy_inode() will be called synchronously, followed
> > > > by ->free_inode() from RCU callback, so you can have both - moving just
> > > > the "finally mark for reuse" part into ->free_inode() would be OK.
> > > > Any blocking stuff (if any) can be left in ->destroy_inode()...
> > > 
> > > BTW, we *do* have a problem with ext4 fast symlinks.  Pathwalk assumes that
> > > strings it parses are not changing under it.  There are rather delicate
> > > dances in dcache lookups re possibility of ->d_name contents changing under
> > > it, but the search key is assumed to be stable.
> > > 
> > > What's more, there's a correctness issue even if we do not oops.  Currently
> > > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from
> > > the stack.  After all, we'd just finished traversing what used to be the
> > > contents of a symlink that used to be in the right place.  It might have been
> > > unlinked while we'd been traversing it, but that's not a correctness issue.
> > > 
> > > But that critically depends upon the contents not getting mangled.  If it
> > > *can* be screwed by such unlink, we risk successful lookup leading to the
> > 
> > Out of curiosity: whether or not it can get mangled depends on the
> > filesystem and how it implements fast symlinks or do fast symlinks
> > currently guarantee that contents are mangled?
> 
> Not sure if I understand your question correctly...
> 
> 	Filesystems should guarantee that the contents of string returned
> by ->get_link() (or pointed to by ->i_link) remains unchanged for as long
> as we are looking at it (until fs/namei.c:put_link() that drops it or
> fs/namei.c:drop_links() in the end of pathwalk).  Fast symlinks or not -
> doesn't matter.

Yep, got that.

> 
> 	The only cases where that does not hold (there are two of them in
> the entire kernel) happen to be fast symlinks.	Both cases are bugs.

Ok, that's what I was essentially after whether or not they were bugs in
the filesystems or it's a generic bug.

> ext4 case is actually easy to fix - the only reason it ends up mangling
> the string is the way ext4_truncate() implements its check for victim
> being a fast symlink (and thus needing no work).  It gets disrupted
> by zeroing ->i_size, which we need to do in this case (inode removal).
> That's not hard to get right.

Oh, I see, it zeroes i_size and erases i_data which obviously tramples
the fast symlink contents.

Given that ext4 makes use of i_flags for their ext4 inode containers why
couldn't this just be sm like

#define EXT4_FAST_SYMLINK	        0x<some-free-value>

	if (EXT4_I(inode)->i_flags & EXT4_FAST_SYMLINK)
		return <im-a-fast-symlink>;

? Which seems simpler and more obvious to someone reading that code than
logic based on substracting blocks or checking i_size.

Brian Foster Jan. 19, 2022, 2:08 p.m. UTC | #23

On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote:
> On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> > 
> > > No, that just creates a black hole where the VFS inode has been
> > > destroyed but the XFS inode cache doesn't know it's been trashed.
> > > Hence setting XFS_IRECLAIMABLE needs to remain in the during
> > > ->destroy_inode, otherwise the ->lookup side of the cache will think
> > > that are currently still in use by the VFS and hand them straight
> > > back out without going through the inode recycling code.
> > > 
> > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS
> > > part of the inode has been torn down, and that it must go back
> > > through VFS re-initialisation before it can be re-instantiated as a
> > > VFS inode.
> > 
> > OK...
> > 
> > > It would also mean that the inode will need to go through two RCU
> > > grace periods before it gets reclaimed, because XFS uses RCU
> > > protected inode cache lookups internally (e.g. for clustering dirty
> > > inode writeback) and so freeing the inode from the internal
> > > XFS inode cache requires RCU freeing...
> > 
> > Wait a minute.  Where is that RCU delay of yours, relative to
> > xfs_vn_unlink() and xfs_vn_rename() (for target)?
> 
> Both of those drop the inode on an on-disk unlinked list. When the
> last reference goes away, ->destroy_inode then runs inactivation.
> 
> Inactivation then runs transactions to free all the space attached
> to the inode and then removes the inode from the unlinked list and
> frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty
> in memory. It can't be reclaimed until the inode is written to disk
> or the whole inode cluster is freed and the inode marked XFS_ISTALE
> (so won't get written back).
> 
> At that point, a background inode reclaim thread (runs every 5s)
> does a RCU protected lockless radix tree walk to find
> XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it
> moves them to XFS_IRECLAIM state, deletes them from the radix tree
> and frees them via a call_rcu() callback.
> 
> If memory reclaim comes along sooner than this, the
> ->free_cached_objects() superblock shrinker callback runs that RCU
> protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes.
> 
> > And where does
> > it happen in case of e.g. open() + unlink() + close()?
> 
> Same thing - close() drops the last reference, the unlinked inode
> goes through inactivation, then moves into the XFS_IRECLAIMABLE
> state.
> 
> The problem is not -quite- open-unlink-close. The problem case is
> the reallocation of an on-disk inode in the case of
> unlink-close-open(O_CREATE) operations because of the on-disk inode
> allocator policy of aggressive reuse of recently freed inodes.  In
> that case the xfs_iget() lookup will reinstantiate the inode via
> xfs_iget_recycle() and the inode will change identity between VFS
> instantiations.
> 
> This is where a RCU grace period is absolutely required, and we
> don't currently have one. The bug was introduced with RCU freeing of
> inodes (what, 15 years ago now?) and it's only recently that we've
> realised this bug exists via code inspection. We really have no
> evidence that it's actually been tripped over in the wild....
> 

To be fair, we have multiple reports of the NULL ->get_link() variant
and my tests to this point to induce "unexpected" returns of that
function don't manifest in as catastrophic a side effect, so might not
be as immediately noticeable by users. I.e., returning non-string data
doesn't seem to necessarily cause a crash in the vfs and the symlink to
symlink variant is more of an unexpected redirection of a lookup.

IOW, I think it's fairly logical to assume that if users are hitting the
originally reported problem, they're likely dangerously close to the
subsequent problems that have been identified from further inspection of
the related code. I don't think this is purely a case of a "theoretical"
problem that doesn't warrant some form of prioritized fix.

> Unfortunately, the simple fix of adding syncronize_rcu() to
> xfs_iget_recycle() causes significant performance regressions
> because we hit this path quite frequently when workloads use lots of
> temporary files - the on-disk inode allocator policy tends towards
> aggressive re-use of inodes for small sets of temporary files.
> 
> The problem XFS is trying to address is that the VFS inode lifecycle
> does not cater for filesystems that need to both dirty and then
> clean unlinked inodes between iput_final() and ->destroy_inode. It's
> too late to be able to put the inode back on the LRU once we've
> decided to drop the inode if we need to dirty it again. ANd because
> evict() is part of the non-blocking memory reclaim, we aren't
> supposed to block for arbitrarily long periods of time or create
> unbound memory demand processing inode eviction (both of which XFS
> can do in inactivation).
> 
> IOWs, XFS can't free the inode until it's journal releases the
> internal reference on the dirty inode. ext4 doesn't track inodes in
> it's journal - it only tracks inode buffers that contain the changes
> made to the inode, so once the transaction is committed in
> ext4_evict_inode() the inode can be immediately freed via either
> ->destroy_inode or ->free_inode. That option does not exist for XFS
> because we have to wait for the journal to finish with the inode
> before it can be freed. Hence all the background reclaim stuff.
> 
> We've recently solved several of the problems we need to solve to
> reduce the mismatch; avoiding blocking on inode writeback in reclaim
> and background inactivation are two of the major pieces of work we
> needed done before we could even consider more closely aligning XFS
> to the VFS inode cache life cycle model.
> 

The background inactivation work facilitates an incremental improvement
by nature because destroyed inodes go directly to a queue instead of
being processed synchronously. My most recent test to stamp the grace
period info at inode destroy time and conditionally sync at reuse time
shows pretty much no major cost because the common case is that a grace
period has already expired by the time the queue populates, is processed
and said inodes become reclaimable and reallocated. To go beyond just
the performance result, if I open code the conditional sync for tracking
purposes I only see something like 10-15 rcu waits out of the 36k
allocation cycles. If I increase the background workload 4x, the
allocation rate drops to ~33k cycles (which is still pretty much in line
with baseline) and the rcu sync count increases to 70, which again is
relatively nominal over tens of thousands of cycles.

This all requires some more thorough testing, but I'm sure it won't be
absolutely free for every possible workload or environment. But given
that we know this infrastructure is fundamentally broken (by subtle
compatibilities between XFS and the VFS that have evolved over time),
will require some thought and time to fix properly in the filesystem,
that users are running into problems very closely related to it, why not
try to address the fundamental breakage if we can do so with an isolated
change with minimal (but probably not zero) performance impact?

I agree that the unconditional synchronize_rcu() on reuse approach is
just not viable, but so far tests using cond_synchronize_rcu() seem
fairly reasonable. Is there some other problem or concern with such an
approach?

Brian

> The next step is to move the background inode inactivation triggers
> up into ->drop_inode so we can catch inodes that need to be dirtied
> by the filesysetm before they have been marked for eviction by the
> VFS. This will allow us to keep the inode on the VFS LRU (probably
> marked with I_WILL_FREE so everyone else keeps away from it) whilst
> we are waiting for the background inactivation work to be done, the
> journal flushed and the metadata written back. Once clean, we can
> directly evict the inode from the VFS ourselves.
> 
> This would mean we only get clean, reclaimable inodes hitting the
> evict() path, and so at that point we can just remove the inode
> directly from the XFS inode cache from either ->destroy_inode or
> ->free_inode and RCU free it. The recycling of in-memory inodes in
> xfs_iget_cache_hit can go away entirely because no inodes will
> linger in the XFS inode cache without being visible at the VFS
> layer as they do now...
> 
> That's going to take a fair bit of work to realise, and I'm not sure
> yet exactly what mods are going to be needed to either the VFS inode
> infrastructure or the XFS inode cache. 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Dave Chinner Jan. 19, 2022, 10:07 p.m. UTC | #24

On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote:
> On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote:
> > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> > Unfortunately, the simple fix of adding syncronize_rcu() to
> > xfs_iget_recycle() causes significant performance regressions
> > because we hit this path quite frequently when workloads use lots of
> > temporary files - the on-disk inode allocator policy tends towards
> > aggressive re-use of inodes for small sets of temporary files.
> > 
> > The problem XFS is trying to address is that the VFS inode lifecycle
> > does not cater for filesystems that need to both dirty and then
> > clean unlinked inodes between iput_final() and ->destroy_inode. It's
> > too late to be able to put the inode back on the LRU once we've
> > decided to drop the inode if we need to dirty it again. ANd because
> > evict() is part of the non-blocking memory reclaim, we aren't
> > supposed to block for arbitrarily long periods of time or create
> > unbound memory demand processing inode eviction (both of which XFS
> > can do in inactivation).
> > 
> > IOWs, XFS can't free the inode until it's journal releases the
> > internal reference on the dirty inode. ext4 doesn't track inodes in
> > it's journal - it only tracks inode buffers that contain the changes
> > made to the inode, so once the transaction is committed in
> > ext4_evict_inode() the inode can be immediately freed via either
> > ->destroy_inode or ->free_inode. That option does not exist for XFS
> > because we have to wait for the journal to finish with the inode
> > before it can be freed. Hence all the background reclaim stuff.
> > 
> > We've recently solved several of the problems we need to solve to
> > reduce the mismatch; avoiding blocking on inode writeback in reclaim
> > and background inactivation are two of the major pieces of work we
> > needed done before we could even consider more closely aligning XFS
> > to the VFS inode cache life cycle model.
> > 
> 
> The background inactivation work facilitates an incremental improvement
> by nature because destroyed inodes go directly to a queue instead of
> being processed synchronously. My most recent test to stamp the grace
> period info at inode destroy time and conditionally sync at reuse time
> shows pretty much no major cost because the common case is that a grace
> period has already expired by the time the queue populates, is processed
> and said inodes become reclaimable and reallocated.

Yup. Remember that I suggested these conditional variants in the
first place - I do understand what this code does...

> To go beyond just
> the performance result, if I open code the conditional sync for tracking
> purposes I only see something like 10-15 rcu waits out of the 36k
> allocation cycles. If I increase the background workload 4x, the
> allocation rate drops to ~33k cycles (which is still pretty much in line
> with baseline) and the rcu sync count increases to 70, which again is
> relatively nominal over tens of thousands of cycles.

Yup. But that doesn't mean that the calls that trigger are free from
impact. The cost and latency of waiting for an RCU grace period to
expire goes up as the CPU count goes up. e.g. it requires every CPU
running a task goes through a context switch before it returns.
Hence if we end up with situations like, say, the ioend completion
scheduling holdoffs, then that will prevent the RCU sync from
returning for seconds.

IOWs, we're effectively adding unpredictable and non-deterministic
latency into the recycle path that is visible to userspace
applications, and those latencies can be caused by subsystem
functionality not related to XFS. Hence we need to carefully
consider unexpected side-effects of adding a kernel global
synchronisation point into a XFS icache lookup fast path, and these
issues may not be immediately obvious from testing...

> This all requires some more thorough testing, but I'm sure it won't be
> absolutely free for every possible workload or environment. But given
> that we know this infrastructure is fundamentally broken (by subtle
> compatibilities between XFS and the VFS that have evolved over time),
> will require some thought and time to fix properly in the filesystem,
> that users are running into problems very closely related to it, why not
> try to address the fundamental breakage if we can do so with an isolated
> change with minimal (but probably not zero) performance impact?
> 
> I agree that the unconditional synchronize_rcu() on reuse approach is
> just not viable, but so far tests using cond_synchronize_rcu() seem
> fairly reasonable. Is there some other problem or concern with such an
> approach?

Just that the impact of adding RCU sync points means that bad
behaviour outside XFS have a new point where they can adversely
impact on applications doing filesystem operations.

As a temporary mitigation strategy I think it will probably be fine,
but I'd much prefer we get rid of the need for such an RCU sync
point rather than try to maintain a mitigation like this in fast
path code forever.

Cheers,

Dave.

Brian Foster Jan. 20, 2022, 4:03 p.m. UTC | #25

On Thu, Jan 20, 2022 at 09:07:47AM +1100, Dave Chinner wrote:
> On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote:
> > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote:
> > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> > > Unfortunately, the simple fix of adding syncronize_rcu() to
> > > xfs_iget_recycle() causes significant performance regressions
> > > because we hit this path quite frequently when workloads use lots of
> > > temporary files - the on-disk inode allocator policy tends towards
> > > aggressive re-use of inodes for small sets of temporary files.
> > > 
> > > The problem XFS is trying to address is that the VFS inode lifecycle
> > > does not cater for filesystems that need to both dirty and then
> > > clean unlinked inodes between iput_final() and ->destroy_inode. It's
> > > too late to be able to put the inode back on the LRU once we've
> > > decided to drop the inode if we need to dirty it again. ANd because
> > > evict() is part of the non-blocking memory reclaim, we aren't
> > > supposed to block for arbitrarily long periods of time or create
> > > unbound memory demand processing inode eviction (both of which XFS
> > > can do in inactivation).
> > > 
> > > IOWs, XFS can't free the inode until it's journal releases the
> > > internal reference on the dirty inode. ext4 doesn't track inodes in
> > > it's journal - it only tracks inode buffers that contain the changes
> > > made to the inode, so once the transaction is committed in
> > > ext4_evict_inode() the inode can be immediately freed via either
> > > ->destroy_inode or ->free_inode. That option does not exist for XFS
> > > because we have to wait for the journal to finish with the inode
> > > before it can be freed. Hence all the background reclaim stuff.
> > > 
> > > We've recently solved several of the problems we need to solve to
> > > reduce the mismatch; avoiding blocking on inode writeback in reclaim
> > > and background inactivation are two of the major pieces of work we
> > > needed done before we could even consider more closely aligning XFS
> > > to the VFS inode cache life cycle model.
> > > 
> > 
> > The background inactivation work facilitates an incremental improvement
> > by nature because destroyed inodes go directly to a queue instead of
> > being processed synchronously. My most recent test to stamp the grace
> > period info at inode destroy time and conditionally sync at reuse time
> > shows pretty much no major cost because the common case is that a grace
> > period has already expired by the time the queue populates, is processed
> > and said inodes become reclaimable and reallocated.
> 
> Yup. Remember that I suggested these conditional variants in the
> first place - I do understand what this code does...
> 
> > To go beyond just
> > the performance result, if I open code the conditional sync for tracking
> > purposes I only see something like 10-15 rcu waits out of the 36k
> > allocation cycles. If I increase the background workload 4x, the
> > allocation rate drops to ~33k cycles (which is still pretty much in line
> > with baseline) and the rcu sync count increases to 70, which again is
> > relatively nominal over tens of thousands of cycles.
> 
> Yup. But that doesn't mean that the calls that trigger are free from
> impact. The cost and latency of waiting for an RCU grace period to
> expire goes up as the CPU count goes up. e.g. it requires every CPU
> running a task goes through a context switch before it returns.
> Hence if we end up with situations like, say, the ioend completion
> scheduling holdoffs, then that will prevent the RCU sync from
> returning for seconds.
> 

Sure... this is part of the reason the tests I've run so far have all
tried to incorporate background rcuwalk activity, run on a higher cpu
count box, etc. And from the XFS side of the coin, the iget code can
invoke xfs_inodegc_queue_all() in the needs_inactive case before
reclaimable state is a possibility, which queues a work on every cpu
with pending inactive inodes. That is probably unlikely in the
free/alloc case (since needs_inactive inodes are not yet free on disk),
but the broader points are that the inactive processing work has to
complete one way or another before reclaimable state is possible and
that we can probably accommodate a synchronization point here if it's
reasonably filtered. Otherwise...

> IOWs, we're effectively adding unpredictable and non-deterministic
> latency into the recycle path that is visible to userspace
> applications, and those latencies can be caused by subsystem
> functionality not related to XFS. Hence we need to carefully
> consider unexpected side-effects of adding a kernel global
> synchronisation point into a XFS icache lookup fast path, and these
> issues may not be immediately obvious from testing...
> 

... agreed. I think at this point we've also discussed various potential
ways to shift or minimize latency/cost further, so there's probably
still room for refinement if such unexpected things crop up before...

> > This all requires some more thorough testing, but I'm sure it won't be
> > absolutely free for every possible workload or environment. But given
> > that we know this infrastructure is fundamentally broken (by subtle
> > compatibilities between XFS and the VFS that have evolved over time),
> > will require some thought and time to fix properly in the filesystem,
> > that users are running into problems very closely related to it, why not
> > try to address the fundamental breakage if we can do so with an isolated
> > change with minimal (but probably not zero) performance impact?
> > 
> > I agree that the unconditional synchronize_rcu() on reuse approach is
> > just not viable, but so far tests using cond_synchronize_rcu() seem
> > fairly reasonable. Is there some other problem or concern with such an
> > approach?
> 
> Just that the impact of adding RCU sync points means that bad
> behaviour outside XFS have a new point where they can adversely
> impact on applications doing filesystem operations.
> 
> As a temporary mitigation strategy I think it will probably be fine,
> but I'd much prefer we get rid of the need for such an RCU sync
> point rather than try to maintain a mitigation like this in fast
> path code forever.
> 

... we end up here. All in all, this is intended to be a
practical/temporary step toward functional correctness that minimizes
performance impact and disruption (i.e. just as easy to remove when made
unnecessary).

Al,

The caveat to this is I think the practicality of a conditional sync in
the iget recycle code sort of depends on the queueing/batching nature of
inactive inode processing in XFS. If you look at xfs_fs_destroy_inode()
for example, you'll see this is all fairly recent feature/infrastructure
code and that historically we completed most of this transition to
reclaimable state before ->destroy_inode() returns. Therefore, the
concern I have is that on older/stable kernels (where users are hitting
this NULL ->get_link() BUG) the reuse code is far more likely to stall
and slow down here with this change alone (see my earlier numbers on the
unconditional sync_rcu() test for prospective worst case). For that
reason, I'm not sure this is really a backportable solution.

So getting back to your concern around Ian's patch being a
stopgap/bandaid type solution, would you be willing to pull something
like Ian's patch to the vfs if upstream XFS adopts this conditional rcu
sync in the iget reuse code? I think that would ensure that no further
bandaid fixes will be required in the vfs due to XFS inode reuse, but
would also provide an isolated variant of the fix in the VFS that is
more easily backported to stable kernels. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Brian Foster Jan. 20, 2022, 4:34 p.m. UTC | #26

On Thu, Jan 20, 2022 at 11:03:28AM -0500, Brian Foster wrote:
> On Thu, Jan 20, 2022 at 09:07:47AM +1100, Dave Chinner wrote:
> > On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote:
> > > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote:
> > > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote:
> > > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote:
> > > > Unfortunately, the simple fix of adding syncronize_rcu() to
> > > > xfs_iget_recycle() causes significant performance regressions
> > > > because we hit this path quite frequently when workloads use lots of
> > > > temporary files - the on-disk inode allocator policy tends towards
> > > > aggressive re-use of inodes for small sets of temporary files.
> > > > 
> > > > The problem XFS is trying to address is that the VFS inode lifecycle
> > > > does not cater for filesystems that need to both dirty and then
> > > > clean unlinked inodes between iput_final() and ->destroy_inode. It's
> > > > too late to be able to put the inode back on the LRU once we've
> > > > decided to drop the inode if we need to dirty it again. ANd because
> > > > evict() is part of the non-blocking memory reclaim, we aren't
> > > > supposed to block for arbitrarily long periods of time or create
> > > > unbound memory demand processing inode eviction (both of which XFS
> > > > can do in inactivation).
> > > > 
> > > > IOWs, XFS can't free the inode until it's journal releases the
> > > > internal reference on the dirty inode. ext4 doesn't track inodes in
> > > > it's journal - it only tracks inode buffers that contain the changes
> > > > made to the inode, so once the transaction is committed in
> > > > ext4_evict_inode() the inode can be immediately freed via either
> > > > ->destroy_inode or ->free_inode. That option does not exist for XFS
> > > > because we have to wait for the journal to finish with the inode
> > > > before it can be freed. Hence all the background reclaim stuff.
> > > > 
> > > > We've recently solved several of the problems we need to solve to
> > > > reduce the mismatch; avoiding blocking on inode writeback in reclaim
> > > > and background inactivation are two of the major pieces of work we
> > > > needed done before we could even consider more closely aligning XFS
> > > > to the VFS inode cache life cycle model.
> > > > 
> > > 
> > > The background inactivation work facilitates an incremental improvement
> > > by nature because destroyed inodes go directly to a queue instead of
> > > being processed synchronously. My most recent test to stamp the grace
> > > period info at inode destroy time and conditionally sync at reuse time
> > > shows pretty much no major cost because the common case is that a grace
> > > period has already expired by the time the queue populates, is processed
> > > and said inodes become reclaimable and reallocated.
> > 
> > Yup. Remember that I suggested these conditional variants in the
> > first place - I do understand what this code does...
> > 
> > > To go beyond just
> > > the performance result, if I open code the conditional sync for tracking
> > > purposes I only see something like 10-15 rcu waits out of the 36k
> > > allocation cycles. If I increase the background workload 4x, the
> > > allocation rate drops to ~33k cycles (which is still pretty much in line
> > > with baseline) and the rcu sync count increases to 70, which again is
> > > relatively nominal over tens of thousands of cycles.
> > 
> > Yup. But that doesn't mean that the calls that trigger are free from
> > impact. The cost and latency of waiting for an RCU grace period to
> > expire goes up as the CPU count goes up. e.g. it requires every CPU
> > running a task goes through a context switch before it returns.
> > Hence if we end up with situations like, say, the ioend completion
> > scheduling holdoffs, then that will prevent the RCU sync from
> > returning for seconds.
> > 
> 
> Sure... this is part of the reason the tests I've run so far have all
> tried to incorporate background rcuwalk activity, run on a higher cpu
> count box, etc. And from the XFS side of the coin, the iget code can
> invoke xfs_inodegc_queue_all() in the needs_inactive case before
> reclaimable state is a possibility, which queues a work on every cpu
> with pending inactive inodes. That is probably unlikely in the
> free/alloc case (since needs_inactive inodes are not yet free on disk),
> but the broader points are that the inactive processing work has to
> complete one way or another before reclaimable state is possible and
> that we can probably accommodate a synchronization point here if it's
> reasonably filtered. Otherwise...
> 
> > IOWs, we're effectively adding unpredictable and non-deterministic
> > latency into the recycle path that is visible to userspace
> > applications, and those latencies can be caused by subsystem
> > functionality not related to XFS. Hence we need to carefully
> > consider unexpected side-effects of adding a kernel global
> > synchronisation point into a XFS icache lookup fast path, and these
> > issues may not be immediately obvious from testing...
> > 
> 
> ... agreed. I think at this point we've also discussed various potential
> ways to shift or minimize latency/cost further, so there's probably
> still room for refinement if such unexpected things crop up before...
> 
> > > This all requires some more thorough testing, but I'm sure it won't be
> > > absolutely free for every possible workload or environment. But given
> > > that we know this infrastructure is fundamentally broken (by subtle
> > > compatibilities between XFS and the VFS that have evolved over time),
> > > will require some thought and time to fix properly in the filesystem,
> > > that users are running into problems very closely related to it, why not
> > > try to address the fundamental breakage if we can do so with an isolated
> > > change with minimal (but probably not zero) performance impact?
> > > 
> > > I agree that the unconditional synchronize_rcu() on reuse approach is
> > > just not viable, but so far tests using cond_synchronize_rcu() seem
> > > fairly reasonable. Is there some other problem or concern with such an
> > > approach?
> > 
> > Just that the impact of adding RCU sync points means that bad
> > behaviour outside XFS have a new point where they can adversely
> > impact on applications doing filesystem operations.
> > 
> > As a temporary mitigation strategy I think it will probably be fine,
> > but I'd much prefer we get rid of the need for such an RCU sync
> > point rather than try to maintain a mitigation like this in fast
> > path code forever.
> > 
> 
> ... we end up here. All in all, this is intended to be a
> practical/temporary step toward functional correctness that minimizes
> performance impact and disruption (i.e. just as easy to remove when made
> unnecessary).
> 
> Al,
> 
> The caveat to this is I think the practicality of a conditional sync in
> the iget recycle code sort of depends on the queueing/batching nature of
> inactive inode processing in XFS. If you look at xfs_fs_destroy_inode()
> for example, you'll see this is all fairly recent feature/infrastructure
> code and that historically we completed most of this transition to
> reclaimable state before ->destroy_inode() returns. Therefore, the
> concern I have is that on older/stable kernels (where users are hitting
> this NULL ->get_link() BUG) the reuse code is far more likely to stall
> and slow down here with this change alone (see my earlier numbers on the
> unconditional sync_rcu() test for prospective worst case). For that
> reason, I'm not sure this is really a backportable solution.
> 
> So getting back to your concern around Ian's patch being a
> stopgap/bandaid type solution, would you be willing to pull something
> like Ian's patch to the vfs if upstream XFS adopts this conditional rcu
> sync in the iget reuse code? I think that would ensure that no further
> bandaid fixes will be required in the vfs due to XFS inode reuse, but
> would also provide an isolated variant of the fix in the VFS that is
> more easily backported to stable kernels. Thoughts?
> 
> Brian
> 

Oh and I meant to throw up a diff of what I was testing for reference.
My test code was significantly more hacked up with debug code and such,
but if I clean it up a bit the change reduces to the diff below.

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d019c98eb839..8a24ced4d73a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -349,6 +349,10 @@ xfs_iget_recycle(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
+	ASSERT(ip->i_destroy_gp != 0);
+	cond_synchronize_rcu(ip->i_destroy_gp);
+	ip->i_destroy_gp = 0;
+
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	error = xfs_reinit_inode(mp, inode);
 	if (error) {
@@ -2019,6 +2023,7 @@ xfs_inodegc_queue(
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags |= XFS_NEED_INACTIVE;
+	ip->i_destroy_gp = start_poll_synchronize_rcu();
 	spin_unlock(&ip->i_flags_lock);
 
 	gc = get_cpu_ptr(mp->m_inodegc);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c447bf04205a..a978da2332a0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -75,6 +75,8 @@ typedef struct xfs_inode {
 	spinlock_t		i_ioend_lock;
 	struct work_struct	i_ioend_work;
 	struct list_head	i_ioend_list;
+
+	unsigned long		i_destroy_gp;
 } xfs_inode_t;
 
 /* Convert from vfs inode to xfs inode */

vfs: check dentry is still valid in get_link()

Commit Message

Comments

Patch