diff mbox series

[v4] fuse: add new function to invalidate cache for all inodes

Message ID 20250211092604.15160-1-luis@igalia.com (mailing list archive)
State New
Headers show
Series [v4] fuse: add new function to invalidate cache for all inodes | expand

Commit Message

Luis Henriques Feb. 11, 2025, 9:26 a.m. UTC
Currently userspace is able to notify the kernel to invalidate the cache
for an inode.  This means that, if all the inodes in a filesystem need to
be invalidated, then userspace needs to iterate through all of them and do
this kernel notification separately.

This patch adds a new option that allows userspace to invalidate all the
inodes with a single notification operation.  In addition to invalidate
all the inodes, it also shrinks the sb dcache.

Signed-off-by: Luis Henriques <luis@igalia.com>
---
* Changes since v3
- Added comments to clarify semantic changes in fuse_reverse_inval_inode()
  when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
- Added comments to inodes iteration loop to clarify __iget/iput usage
  (suggested by Joanne)
- Dropped get_fuse_mount() call -- fuse_mount can be obtained from
  fuse_ilookup() directly (suggested by Joanne)

(Also dropped the RFC from the subject.)

* Changes since v2
- Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
- Also updated patch description as per checkpatch.pl suggestion.

* Changes since v1
As suggested by Bernd, this patch v2 simply adds an helper function that
will make it easier to replace most of it's code by a call to function
super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.

[1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com

 fs/fuse/inode.c           | 83 +++++++++++++++++++++++++++++++++++----
 include/uapi/linux/fuse.h |  3 ++
 2 files changed, 79 insertions(+), 7 deletions(-)

Comments

Dave Chinner Feb. 11, 2025, 8:56 p.m. UTC | #1
[ FWIW: if the commit message directly references someone else's
related (and somewhat relevant) work, please directly CC those
people on the patch(set). I only noticed this by chance, not because
I read every FUSE related patch that goes by me. ]

On Tue, Feb 11, 2025 at 09:26:04AM +0000, Luis Henriques wrote:
> Currently userspace is able to notify the kernel to invalidate the cache
> for an inode.  This means that, if all the inodes in a filesystem need to
> be invalidated, then userspace needs to iterate through all of them and do
> this kernel notification separately.
> 
> This patch adds a new option that allows userspace to invalidate all the
> inodes with a single notification operation.  In addition to invalidate
> all the inodes, it also shrinks the sb dcache.

That, IMO, seems like a bit naive - we generally don't allow user
controlled denial of service vectors to be added to the kernel. i.e.
this is the equivalent of allowing FUSE fs specific 'echo 1 >
/proc/sys/vm/drop_caches' via some fuse specific UAPI. We only allow
root access to /proc/sys/vm/drop_caches because it can otherwise be
easily abused to cause system wide performance issues.

It also strikes me as a somewhat dangerous precendent - invalidating
random VFS caches through user APIs hidden deep in random fs
implementations makes for poor visibility and difficult maintenance
of VFS level functionality...

> Signed-off-by: Luis Henriques <luis@igalia.com>
> ---
> * Changes since v3
> - Added comments to clarify semantic changes in fuse_reverse_inval_inode()
>   when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
> - Added comments to inodes iteration loop to clarify __iget/iput usage
>   (suggested by Joanne)
> - Dropped get_fuse_mount() call -- fuse_mount can be obtained from
>   fuse_ilookup() directly (suggested by Joanne)
> 
> (Also dropped the RFC from the subject.)
> 
> * Changes since v2
> - Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
> - Also updated patch description as per checkpatch.pl suggestion.
> 
> * Changes since v1
> As suggested by Bernd, this patch v2 simply adds an helper function that
> will make it easier to replace most of it's code by a call to function
> super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.
> 
> [1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com

That doesn't make the functionality any more palatable.

Those iterators are the first step in removing the VFS inode list
and only maintaining it in filesystems that actually need this
functionality. We want this list to go away because maintaining it
is a general VFS cache scalability limitation.

i.e. if a filesystem has internal functionality that requires
iterating all instantiated inodes, the filesystem itself should
maintain that list in the most efficient manner for the filesystem's
iteration requirements not rely on the VFS to maintain this
information for it.

That's the point of the iterator methods the above patchset adds -
it allows the filesystem to provide the VFS with a method for
iterating all inodes in the filesystem whilst the transition period
where we rework the inode cache structure (i.e. per-sb hash tables
for inode lookup, inode LRU caching goes away, etc). Once that
rework gets done, there won't be a VFS inode cache to iterate.....

>  fs/fuse/inode.c           | 83 +++++++++++++++++++++++++++++++++++----
>  include/uapi/linux/fuse.h |  3 ++
>  2 files changed, 79 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e9db2cb8c150..5aa49856731a 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -547,25 +547,94 @@ struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
>  	return NULL;
>  }
>  
> +static void inval_single_inode(struct inode *inode, struct fuse_conn *fc)
> +{
> +	struct fuse_inode *fi;
> +
> +	fi = get_fuse_inode(inode);
> +	spin_lock(&fi->lock);
> +	fi->attr_version = atomic64_inc_return(&fc->attr_version);
> +	spin_unlock(&fi->lock);
> +	fuse_invalidate_attr(inode);
> +	forget_all_cached_acls(inode);
> +}
> +
> +static int fuse_reverse_inval_all(struct fuse_conn *fc)
> +{
> +	struct fuse_mount *fm;
> +	struct super_block *sb;
> +	struct inode *inode, *old_inode = NULL;
> +
> +	inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
> +	if (!inode || !fm)
> +		return -ENOENT;
> +
> +	iput(inode);
> +	sb = fm->sb;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> +		    !atomic_read(&inode->i_count)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}

This skips every inode that is unreferenced and cached on the
LRU. i.e. it only invalidates inodes that have a current reference
(e.g. dentry pins it, has an open file, etc).

What's the point of only invalidating actively referenced inodes?

> +		/*
> +		 * This __iget()/iput() dance is required so that we can release
> +		 * the sb lock and continue the iteration on the previous
> +		 * inode.  If we don't keep a ref to the old inode it could have
> +		 * disappear.  This way we can safely call cond_resched() when
> +		 * there's a huge amount of inodes to iterate.
> +		 */

If there's a huge amount of inodes to iterate, then most of them are
going to be on the LRU and unreferenced, so this code won't even get
here to be able to run cond_resched().

> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +		iput(old_inode);
> +
> +		inval_single_inode(inode, fc);
> +
> +		old_inode = inode;
> +		cond_resched();
> +		spin_lock(&sb->s_inode_list_lock);
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +	iput(old_inode);
> +
> +	shrink_dcache_sb(sb);

Why drop all the referenced inodes held by the dentry cache -after-
inode invalidation? Doesn't this mean that racing operations are
going to see a valid dentries backed by an invalidated inode?  Why
aren't the dentries pruned from the cache first, and new lookups
blocked until the invalidation completes?

I'm left to ponder why the invalidation isn't simply:

	/* Remove all possible active references to cached inodes */
	shrink_dcache_sb();

	/* Remove all unreferenced inodes from cache */
	invalidate_inodes();

Which will result in far more of the inode cache for the filesystem
being invalidated than the above code....

-Dave.
Luis Henriques Feb. 12, 2025, 11:32 a.m. UTC | #2
On Wed, Feb 12 2025, Dave Chinner wrote:

> [ FWIW: if the commit message directly references someone else's
> related (and somewhat relevant) work, please directly CC those
> people on the patch(set). I only noticed this by chance, not because
> I read every FUSE related patch that goes by me. ]

Point taken -- I should have included you on CC since the initial RFC.

> On Tue, Feb 11, 2025 at 09:26:04AM +0000, Luis Henriques wrote:
>> Currently userspace is able to notify the kernel to invalidate the cache
>> for an inode.  This means that, if all the inodes in a filesystem need to
>> be invalidated, then userspace needs to iterate through all of them and do
>> this kernel notification separately.
>> 
>> This patch adds a new option that allows userspace to invalidate all the
>> inodes with a single notification operation.  In addition to invalidate
>> all the inodes, it also shrinks the sb dcache.
>
> That, IMO, seems like a bit naive - we generally don't allow user
> controlled denial of service vectors to be added to the kernel. i.e.
> this is the equivalent of allowing FUSE fs specific 'echo 1 >
> /proc/sys/vm/drop_caches' via some fuse specific UAPI. We only allow
> root access to /proc/sys/vm/drop_caches because it can otherwise be
> easily abused to cause system wide performance issues.
>
> It also strikes me as a somewhat dangerous precendent - invalidating
> random VFS caches through user APIs hidden deep in random fs
> implementations makes for poor visibility and difficult maintenance
> of VFS level functionality...

Hmm... OK, I understand the concern and your comment makes perfect sense.
But would it be acceptable to move this API upper in the stack and make it
visible at the VFS layer?  Something similar to the 'drop_caches' but with
a superblock granularity.  I haven't spent any time thinking how could
that be done, but it wouldn't be "hidden deep" anymore.

>> Signed-off-by: Luis Henriques <luis@igalia.com>
>> ---
>> * Changes since v3
>> - Added comments to clarify semantic changes in fuse_reverse_inval_inode()
>>   when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
>> - Added comments to inodes iteration loop to clarify __iget/iput usage
>>   (suggested by Joanne)
>> - Dropped get_fuse_mount() call -- fuse_mount can be obtained from
>>   fuse_ilookup() directly (suggested by Joanne)
>> 
>> (Also dropped the RFC from the subject.)
>> 
>> * Changes since v2
>> - Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
>> - Also updated patch description as per checkpatch.pl suggestion.
>> 
>> * Changes since v1
>> As suggested by Bernd, this patch v2 simply adds an helper function that
>> will make it easier to replace most of it's code by a call to function
>> super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.
>> 
>> [1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com
>
> That doesn't make the functionality any more palatable.
>
> Those iterators are the first step in removing the VFS inode list
> and only maintaining it in filesystems that actually need this
> functionality. We want this list to go away because maintaining it
> is a general VFS cache scalability limitation.
>
> i.e. if a filesystem has internal functionality that requires
> iterating all instantiated inodes, the filesystem itself should
> maintain that list in the most efficient manner for the filesystem's
> iteration requirements not rely on the VFS to maintain this
> information for it.

Right, and in my use-case that's exactly what is currently being done: the
FUSE API to invalidate individual inodes is being used.  This new
functionality just tries to make life easier to userspace when *all* the
inodes need to be invalidated. (For reference, the use-case is CVMFS, a
read-only FS, where new generations of a filesystem snapshot may become
available at some point and the previous one needs to be wiped from the
cache.)

> That's the point of the iterator methods the above patchset adds -
> it allows the filesystem to provide the VFS with a method for
> iterating all inodes in the filesystem whilst the transition period
> where we rework the inode cache structure (i.e. per-sb hash tables
> for inode lookup, inode LRU caching goes away, etc). Once that
> rework gets done, there won't be a VFS inode cache to iterate.....

And re-reading the cover-letter in that patchset helped understanding that
that is indeed the goal.  My patch mentioned that iterator because we
could eventually take advantage of it, but clearly no new users are
expected/desired.

>>  fs/fuse/inode.c           | 83 +++++++++++++++++++++++++++++++++++----
>>  include/uapi/linux/fuse.h |  3 ++
>>  2 files changed, 79 insertions(+), 7 deletions(-)
>> 
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index e9db2cb8c150..5aa49856731a 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -547,25 +547,94 @@ struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
>>  	return NULL;
>>  }
>>  
>> +static void inval_single_inode(struct inode *inode, struct fuse_conn *fc)
>> +{
>> +	struct fuse_inode *fi;
>> +
>> +	fi = get_fuse_inode(inode);
>> +	spin_lock(&fi->lock);
>> +	fi->attr_version = atomic64_inc_return(&fc->attr_version);
>> +	spin_unlock(&fi->lock);
>> +	fuse_invalidate_attr(inode);
>> +	forget_all_cached_acls(inode);
>> +}
>> +
>> +static int fuse_reverse_inval_all(struct fuse_conn *fc)
>> +{
>> +	struct fuse_mount *fm;
>> +	struct super_block *sb;
>> +	struct inode *inode, *old_inode = NULL;
>> +
>> +	inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
>> +	if (!inode || !fm)
>> +		return -ENOENT;
>> +
>> +	iput(inode);
>> +	sb = fm->sb;
>> +
>> +	spin_lock(&sb->s_inode_list_lock);
>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>> +		spin_lock(&inode->i_lock);
>> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>> +		    !atomic_read(&inode->i_count)) {
>> +			spin_unlock(&inode->i_lock);
>> +			continue;
>> +		}
>
> This skips every inode that is unreferenced and cached on the
> LRU. i.e. it only invalidates inodes that have a current reference
> (e.g. dentry pins it, has an open file, etc).
>
> What's the point of only invalidating actively referenced inodes?
>
>> +		/*
>> +		 * This __iget()/iput() dance is required so that we can release
>> +		 * the sb lock and continue the iteration on the previous
>> +		 * inode.  If we don't keep a ref to the old inode it could have
>> +		 * disappear.  This way we can safely call cond_resched() when
>> +		 * there's a huge amount of inodes to iterate.
>> +		 */
>
> If there's a huge amount of inodes to iterate, then most of them are
> going to be on the LRU and unreferenced, so this code won't even get
> here to be able to run cond_resched().

Sigh.  Looks like I got this wrong.  With capital 'W'.

>> +		__iget(inode);
>> +		spin_unlock(&inode->i_lock);
>> +		spin_unlock(&sb->s_inode_list_lock);
>> +		iput(old_inode);
>> +
>> +		inval_single_inode(inode, fc);
>> +
>> +		old_inode = inode;
>> +		cond_resched();
>> +		spin_lock(&sb->s_inode_list_lock);
>> +	}
>> +	spin_unlock(&sb->s_inode_list_lock);
>> +	iput(old_inode);
>> +
>> +	shrink_dcache_sb(sb);
>
> Why drop all the referenced inodes held by the dentry cache -after-
> inode invalidation? Doesn't this mean that racing operations are
> going to see a valid dentries backed by an invalidated inode?  Why
> aren't the dentries pruned from the cache first, and new lookups
> blocked until the invalidation completes?
>
> I'm left to ponder why the invalidation isn't simply:
>
> 	/* Remove all possible active references to cached inodes */
> 	shrink_dcache_sb();
>
> 	/* Remove all unreferenced inodes from cache */
> 	invalidate_inodes();
>
> Which will result in far more of the inode cache for the filesystem
> being invalidated than the above code....

To be honest, my initial attempt to implement this feature actually used
invalidate_inodes().  For some reason that I don't remember anymore why I
decided to implement the iterator myself.  I'll go look at that code again
and run some tests on this (much!) simplified version of the invalidation
function your suggesting.

Also, thanks a lot for your comments, Dave.  Much appreciated!  I'll make
sure any other rev of this patch will include you ;-)

Cheers,
Dave Chinner Feb. 12, 2025, 10:05 p.m. UTC | #3
On Wed, Feb 12, 2025 at 11:32:40AM +0000, Luis Henriques wrote:
> On Wed, Feb 12 2025, Dave Chinner wrote:
> 
> > [ FWIW: if the commit message directly references someone else's
> > related (and somewhat relevant) work, please directly CC those
> > people on the patch(set). I only noticed this by chance, not because
> > I read every FUSE related patch that goes by me. ]
> 
> Point taken -- I should have included you on CC since the initial RFC.
> 
> > On Tue, Feb 11, 2025 at 09:26:04AM +0000, Luis Henriques wrote:
> >> Currently userspace is able to notify the kernel to invalidate the cache
> >> for an inode.  This means that, if all the inodes in a filesystem need to
> >> be invalidated, then userspace needs to iterate through all of them and do
> >> this kernel notification separately.
> >> 
> >> This patch adds a new option that allows userspace to invalidate all the
> >> inodes with a single notification operation.  In addition to invalidate
> >> all the inodes, it also shrinks the sb dcache.
> >
> > That, IMO, seems like a bit naive - we generally don't allow user
> > controlled denial of service vectors to be added to the kernel. i.e.
> > this is the equivalent of allowing FUSE fs specific 'echo 1 >
> > /proc/sys/vm/drop_caches' via some fuse specific UAPI. We only allow
> > root access to /proc/sys/vm/drop_caches because it can otherwise be
> > easily abused to cause system wide performance issues.
> >
> > It also strikes me as a somewhat dangerous precendent - invalidating
> > random VFS caches through user APIs hidden deep in random fs
> > implementations makes for poor visibility and difficult maintenance
> > of VFS level functionality...
> 
> Hmm... OK, I understand the concern and your comment makes perfect sense.
> But would it be acceptable to move this API upper in the stack and make it
> visible at the VFS layer?  Something similar to the 'drop_caches' but with
> a superblock granularity.  I haven't spent any time thinking how could
> that be done, but it wouldn't be "hidden deep" anymore.

I'm yet to see any justification for why 'user driven entire
filesystem cache invalidation' is needed. Get agreement on whether
the functionality should exist first, then worry about how to
implement it.

> >> Signed-off-by: Luis Henriques <luis@igalia.com>
> >> ---
> >> * Changes since v3
> >> - Added comments to clarify semantic changes in fuse_reverse_inval_inode()
> >>   when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
> >> - Added comments to inodes iteration loop to clarify __iget/iput usage
> >>   (suggested by Joanne)
> >> - Dropped get_fuse_mount() call -- fuse_mount can be obtained from
> >>   fuse_ilookup() directly (suggested by Joanne)
> >> 
> >> (Also dropped the RFC from the subject.)
> >> 
> >> * Changes since v2
> >> - Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
> >> - Also updated patch description as per checkpatch.pl suggestion.
> >> 
> >> * Changes since v1
> >> As suggested by Bernd, this patch v2 simply adds an helper function that
> >> will make it easier to replace most of it's code by a call to function
> >> super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.
> >> 
> >> [1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com
> >
> > That doesn't make the functionality any more palatable.
> >
> > Those iterators are the first step in removing the VFS inode list
> > and only maintaining it in filesystems that actually need this
> > functionality. We want this list to go away because maintaining it
> > is a general VFS cache scalability limitation.
> >
> > i.e. if a filesystem has internal functionality that requires
> > iterating all instantiated inodes, the filesystem itself should
> > maintain that list in the most efficient manner for the filesystem's
> > iteration requirements not rely on the VFS to maintain this
> > information for it.
> 
> Right, and in my use-case that's exactly what is currently being done: the
> FUSE API to invalidate individual inodes is being used.
>
> This new
> functionality just tries to make life easier to userspace when *all* the
> inodes need to be invalidated. (For reference, the use-case is CVMFS, a
> read-only FS, where new generations of a filesystem snapshot may become
> available at some point and the previous one needs to be wiped from the
> cache.)

But you can't actually "wipe" referenced inodes from cache. That is a
use case for revoke(), not inode cache invalidation.

> > I'm left to ponder why the invalidation isn't simply:
> >
> > 	/* Remove all possible active references to cached inodes */
> > 	shrink_dcache_sb();
> >
> > 	/* Remove all unreferenced inodes from cache */
> > 	invalidate_inodes();
> >
> > Which will result in far more of the inode cache for the filesystem
> > being invalidated than the above code....
> 
> To be honest, my initial attempt to implement this feature actually used
> invalidate_inodes().  For some reason that I don't remember anymore why I
> decided to implement the iterator myself.  I'll go look at that code again
> and run some tests on this (much!) simplified version of the invalidation
> function your suggesting.

The above code, while simpler, still doesn't resolve the problem of
invalidation of inodes that have active references (e.g. open files
on them). They can't be "invalidated" in this way - they can't be
removed from cache until all active references go away.

i.e. any operation that is based on the assumption that we can
remove all references to inodes and dentries by walking across them
and dropping cache references to them is fundamentally flawed. To do
this reliably, all active references have to be hunted down and
released before the inodes can be removed from VFS visibility. i.e.
the mythical revoke() operation would need to be implemented for
this to work...

-Dave.
Luis Henriques Feb. 13, 2025, 11:08 a.m. UTC | #4
On Thu, Feb 13 2025, Dave Chinner wrote:

> On Wed, Feb 12, 2025 at 11:32:40AM +0000, Luis Henriques wrote:
>> On Wed, Feb 12 2025, Dave Chinner wrote:
>> 
>> > [ FWIW: if the commit message directly references someone else's
>> > related (and somewhat relevant) work, please directly CC those
>> > people on the patch(set). I only noticed this by chance, not because
>> > I read every FUSE related patch that goes by me. ]
>> 
>> Point taken -- I should have included you on CC since the initial RFC.
>> 
>> > On Tue, Feb 11, 2025 at 09:26:04AM +0000, Luis Henriques wrote:
>> >> Currently userspace is able to notify the kernel to invalidate the cache
>> >> for an inode.  This means that, if all the inodes in a filesystem need to
>> >> be invalidated, then userspace needs to iterate through all of them and do
>> >> this kernel notification separately.
>> >> 
>> >> This patch adds a new option that allows userspace to invalidate all the
>> >> inodes with a single notification operation.  In addition to invalidate
>> >> all the inodes, it also shrinks the sb dcache.
>> >
>> > That, IMO, seems like a bit naive - we generally don't allow user
>> > controlled denial of service vectors to be added to the kernel. i.e.
>> > this is the equivalent of allowing FUSE fs specific 'echo 1 >
>> > /proc/sys/vm/drop_caches' via some fuse specific UAPI. We only allow
>> > root access to /proc/sys/vm/drop_caches because it can otherwise be
>> > easily abused to cause system wide performance issues.
>> >
>> > It also strikes me as a somewhat dangerous precendent - invalidating
>> > random VFS caches through user APIs hidden deep in random fs
>> > implementations makes for poor visibility and difficult maintenance
>> > of VFS level functionality...
>> 
>> Hmm... OK, I understand the concern and your comment makes perfect sense.
>> But would it be acceptable to move this API upper in the stack and make it
>> visible at the VFS layer?  Something similar to the 'drop_caches' but with
>> a superblock granularity.  I haven't spent any time thinking how could
>> that be done, but it wouldn't be "hidden deep" anymore.
>
> I'm yet to see any justification for why 'user driven entire
> filesystem cache invalidation' is needed. Get agreement on whether
> the functionality should exist first, then worry about how to
> implement it.
>
>> >> Signed-off-by: Luis Henriques <luis@igalia.com>
>> >> ---
>> >> * Changes since v3
>> >> - Added comments to clarify semantic changes in fuse_reverse_inval_inode()
>> >>   when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
>> >> - Added comments to inodes iteration loop to clarify __iget/iput usage
>> >>   (suggested by Joanne)
>> >> - Dropped get_fuse_mount() call -- fuse_mount can be obtained from
>> >>   fuse_ilookup() directly (suggested by Joanne)
>> >> 
>> >> (Also dropped the RFC from the subject.)
>> >> 
>> >> * Changes since v2
>> >> - Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
>> >> - Also updated patch description as per checkpatch.pl suggestion.
>> >> 
>> >> * Changes since v1
>> >> As suggested by Bernd, this patch v2 simply adds an helper function that
>> >> will make it easier to replace most of it's code by a call to function
>> >> super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.
>> >> 
>> >> [1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com
>> >
>> > That doesn't make the functionality any more palatable.
>> >
>> > Those iterators are the first step in removing the VFS inode list
>> > and only maintaining it in filesystems that actually need this
>> > functionality. We want this list to go away because maintaining it
>> > is a general VFS cache scalability limitation.
>> >
>> > i.e. if a filesystem has internal functionality that requires
>> > iterating all instantiated inodes, the filesystem itself should
>> > maintain that list in the most efficient manner for the filesystem's
>> > iteration requirements not rely on the VFS to maintain this
>> > information for it.
>> 
>> Right, and in my use-case that's exactly what is currently being done: the
>> FUSE API to invalidate individual inodes is being used.
>>
>> This new
>> functionality just tries to make life easier to userspace when *all* the
>> inodes need to be invalidated. (For reference, the use-case is CVMFS, a
>> read-only FS, where new generations of a filesystem snapshot may become
>> available at some point and the previous one needs to be wiped from the
>> cache.)
>
> But you can't actually "wipe" referenced inodes from cache. That is a
> use case for revoke(), not inode cache invalidation.

I guess the word "wipe" wasn't the best choice.  See below.

>> > I'm left to ponder why the invalidation isn't simply:
>> >
>> > 	/* Remove all possible active references to cached inodes */
>> > 	shrink_dcache_sb();
>> >
>> > 	/* Remove all unreferenced inodes from cache */
>> > 	invalidate_inodes();
>> >
>> > Which will result in far more of the inode cache for the filesystem
>> > being invalidated than the above code....
>> 
>> To be honest, my initial attempt to implement this feature actually used
>> invalidate_inodes().  For some reason that I don't remember anymore why I
>> decided to implement the iterator myself.  I'll go look at that code again
>> and run some tests on this (much!) simplified version of the invalidation
>> function your suggesting.
>
> The above code, while simpler, still doesn't resolve the problem of
> invalidation of inodes that have active references (e.g. open files
> on them). They can't be "invalidated" in this way - they can't be
> removed from cache until all active references go away.

Sure, I understand that and that's *not* what I'm trying to do.  I guess
I'm just failing to describe my goal.  If there's a userspace process that
has a file open for an inode that does not exist anymore, that process
will continue using it -- the user-space filesystem will have to deal with
that.

Right now, fuse allows the user-space filesystem to notify the kernel that
*one* inode is not valid anymore.  This is a per inode operation.  I guess
this is very useful, for example, for network filesystems, where an inode
may have been deleted from somewhere else.

However, when user-space wants to do this for all the filesystem inodes,
it will be slow.  With my patch all I wanted to do is to make it a bit
less painful by moving the inodes iteration into the kernel.

Cheers,
diff mbox series

Patch

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e9db2cb8c150..5aa49856731a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -547,25 +547,94 @@  struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
 	return NULL;
 }
 
+static void inval_single_inode(struct inode *inode, struct fuse_conn *fc)
+{
+	struct fuse_inode *fi;
+
+	fi = get_fuse_inode(inode);
+	spin_lock(&fi->lock);
+	fi->attr_version = atomic64_inc_return(&fc->attr_version);
+	spin_unlock(&fi->lock);
+	fuse_invalidate_attr(inode);
+	forget_all_cached_acls(inode);
+}
+
+static int fuse_reverse_inval_all(struct fuse_conn *fc)
+{
+	struct fuse_mount *fm;
+	struct super_block *sb;
+	struct inode *inode, *old_inode = NULL;
+
+	inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
+	if (!inode || !fm)
+		return -ENOENT;
+
+	iput(inode);
+	sb = fm->sb;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_count)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		/*
+		 * This __iget()/iput() dance is required so that we can release
+		 * the sb lock and continue the iteration on the previous
+		 * inode.  If we don't keep a ref to the old inode it could have
+		 * disappear.  This way we can safely call cond_resched() when
+		 * there's a huge amount of inodes to iterate.
+		 */
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+		iput(old_inode);
+
+		inval_single_inode(inode, fc);
+
+		old_inode = inode;
+		cond_resched();
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(old_inode);
+
+	shrink_dcache_sb(sb);
+
+	return 0;
+}
+
+/*
+ * Notify to invalidate inodes cache.  It can be called with @nodeid set to
+ * either:
+ *
+ * - An inode number - Any pending writebacks within the rage [@offset @len]
+ *   will be triggered and the inode will be validated.  To invalidate the whole
+ *   cache @offset has to be set to '0' and @len needs to be <= '0'; if @offset
+ *   is negative, only the inode attributes are invalidated.
+ *
+ * - FUSE_INVAL_ALL_INODES - All the inodes in the superblock are invalidated
+ *   and the whole dcache is shrinked.
+ */
 int fuse_reverse_inval_inode(struct fuse_conn *fc, u64 nodeid,
 			     loff_t offset, loff_t len)
 {
-	struct fuse_inode *fi;
 	struct inode *inode;
 	pgoff_t pg_start;
 	pgoff_t pg_end;
 
+	if (nodeid == FUSE_INVAL_ALL_INODES)
+		return fuse_reverse_inval_all(fc);
+
 	inode = fuse_ilookup(fc, nodeid, NULL);
 	if (!inode)
 		return -ENOENT;
 
-	fi = get_fuse_inode(inode);
-	spin_lock(&fi->lock);
-	fi->attr_version = atomic64_inc_return(&fc->attr_version);
-	spin_unlock(&fi->lock);
+	inval_single_inode(inode, fc);
 
-	fuse_invalidate_attr(inode);
-	forget_all_cached_acls(inode);
 	if (offset >= 0) {
 		pg_start = offset >> PAGE_SHIFT;
 		if (len <= 0)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5e0eb41d967e..e5852b63f99f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -669,6 +669,9 @@  enum fuse_notify_code {
 	FUSE_NOTIFY_CODE_MAX,
 };
 
+/* The nodeid to request to invalidate all inodes */
+#define FUSE_INVAL_ALL_INODES 0
+
 /* The read buffer is required to be at least 8k, but may be much larger */
 #define FUSE_MIN_READ_BUFFER 8192