Message ID | 20241002014017.3801899-5-david@fromorbit.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfs: improving inode cache iteration scalability | expand |
On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote: > --- a/security/landlock/fs.c > +++ b/security/landlock/fs.c > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security) > > /* > * Release the inodes used in a security policy. > - * > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes() > */ > +static int release_inode_fn(struct inode *inode, void *data) Looks like this is called from the sb_delete LSM hook, which is only implemented by landlock, and only called from generic_shutdown_super, separated from evict_inodes only by call to fsnotify_sb_delete. Why did LSM not hook into that and instead added another iteration of the per-sb inode list? (Note that this is not tying to get Dave to fix this, just noticed it when reviewing this series)
On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote: > On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote: > > --- a/security/landlock/fs.c > > +++ b/security/landlock/fs.c > > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security) > > > > /* > > * Release the inodes used in a security policy. > > - * > > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes() > > */ > > +static int release_inode_fn(struct inode *inode, void *data) > > Looks like this is called from the sb_delete LSM hook, which > is only implemented by landlock, and only called from > generic_shutdown_super, separated from evict_inodes only by call > to fsnotify_sb_delete. Why did LSM not hook into that and instead An the main thing that fsnotify_sb_delete does is yet another inode iteration.. Ay chance you all could get together an figure out how to get down to a single sb inode iteration per unmount?
On Thu 03-10-24 00:38:05, Christoph Hellwig wrote: > On Thu, Oct 03, 2024 at 12:23:41AM -0700, Christoph Hellwig wrote: > > On Wed, Oct 02, 2024 at 11:33:21AM +1000, Dave Chinner wrote: > > > --- a/security/landlock/fs.c > > > +++ b/security/landlock/fs.c > > > @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security) > > > > > > /* > > > * Release the inodes used in a security policy. > > > - * > > > - * Cf. fsnotify_unmount_inodes() and invalidate_inodes() > > > */ > > > +static int release_inode_fn(struct inode *inode, void *data) > > > > Looks like this is called from the sb_delete LSM hook, which > > is only implemented by landlock, and only called from > > generic_shutdown_super, separated from evict_inodes only by call > > to fsnotify_sb_delete. Why did LSM not hook into that and instead > > An the main thing that fsnotify_sb_delete does is yet another inode > iteration.. > > Ay chance you all could get together an figure out how to get down > to a single sb inode iteration per unmount? Fair enough. If we go with the iterator variant I've suggested to Dave in [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and Landlocks hook_sb_delete() into a single iteration relatively easily. But I'd wait with that convertion until this series lands. Honza [1] https://lore.kernel.org/all/20241003114555.bl34fkqsja4s5tok@quack3
On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote: > Fair enough. If we go with the iterator variant I've suggested to Dave in > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and > Landlocks hook_sb_delete() into a single iteration relatively easily. But > I'd wait with that convertion until this series lands. I don't see how that has anything to do with iterators or not.
On Thu 03-10-24 05:11:11, Christoph Hellwig wrote: > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote: > > Fair enough. If we go with the iterator variant I've suggested to Dave in > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and > > Landlocks hook_sb_delete() into a single iteration relatively easily. But > > I'd wait with that convertion until this series lands. > > I don't see how that has anything to do with iterators or not. Well, the patches would obviously conflict which seems pointless if we could live with three iterations for a few years until somebody noticed :). And with current Dave's version of iterators it will not be possible to integrate evict_inodes() iteration with the other two without a layering violation. Still we could go from 3 to 2 iterations. Honza
On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote: > On Thu 03-10-24 05:11:11, Christoph Hellwig wrote: > > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote: > > > Fair enough. If we go with the iterator variant I've suggested to Dave in > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But > > > I'd wait with that convertion until this series lands. > > > > I don't see how that has anything to do with iterators or not. > > Well, the patches would obviously conflict Conflict with what? > which seems pointless if we > could live with three iterations for a few years until somebody noticed :). > And with current Dave's version of iterators it will not be possible to > integrate evict_inodes() iteration with the other two without a layering > violation. Still we could go from 3 to 2 iterations. What layering violation? Below is quick compile tested part to do the fsnotify side and get rid of the fsnotify iteration, which looks easily worth it. landlock is a bit more complex because of lsm hooks, and the internal underobj abstraction inside of landlock. But looking at the release inode data vs unomunt synchronization it has and the duplicate code I think doing it this way is worth the effort even more, it'll just need someone who knows landlock and the lsm layering to help with the work. diff --git a/fs/inode.c b/fs/inode.c index 3f335f78c5b228..7d5f8a09e4d29d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head) */ static int evict_inode_fn(struct inode *inode, void *data) { + struct super_block *sb = inode->i_sb; struct list_head *dispose = data; + bool post_unmount = !(sb->s_flags & SB_ACTIVE); spin_lock(&inode->i_lock); - if (atomic_read(&inode->i_count) || - (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) { + if (atomic_read(&inode->i_count)) { + spin_unlock(&inode->i_lock); + + /* for each watch, send FS_UNMOUNT and then remove it */ + if (post_unmount && fsnotify_sb_info(sb)) { + fsnotify_inode(inode, FS_UNMOUNT); + fsnotify_inode_delete(inode); + } + return INO_ITER_DONE; + } + + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(&inode->i_lock); return INO_ITER_DONE; } diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 68c34ed9427190..cf89aa69e82c8d 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -28,16 +28,6 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt) fsnotify_clear_marks_by_mount(mnt); } -static int fsnotify_unmount_inode_fn(struct inode *inode, void *data) -{ - spin_unlock(&inode->i_lock); - - /* for each watch, send FS_UNMOUNT and then remove it */ - fsnotify_inode(inode, FS_UNMOUNT); - fsnotify_inode_delete(inode); - return INO_ITER_DONE; -} - void fsnotify_sb_delete(struct super_block *sb) { struct fsnotify_sb_info *sbinfo = fsnotify_sb_info(sb); @@ -46,19 +36,6 @@ void fsnotify_sb_delete(struct super_block *sb) if (!sbinfo) return; - /* - * If i_count is zero, the inode cannot have any watches and - * doing an __iget/iput with SB_ACTIVE clear would actually - * evict all inodes with zero i_count from icache which is - * unnecessarily violent and may in fact be illegal to do. - * However, we should have been called /after/ evict_inodes - * removed all zero refcount inodes, in any case. Hence we use - * INO_ITER_REFERENCED to ensure zero refcount inodes are filtered - * properly. - */ - super_iter_inodes(sb, fsnotify_unmount_inode_fn, NULL, - INO_ITER_REFERENCED); - fsnotify_clear_marks_by_sb(sb); /* Wait for outstanding object references from connectors */ wait_var_event(fsnotify_sb_watched_objects(sb), diff --git a/fs/super.c b/fs/super.c index 971ad4e996e0ba..88dd1703fe73db 100644 --- a/fs/super.c +++ b/fs/super.c @@ -167,28 +167,17 @@ static void super_wake(struct super_block *sb, unsigned int flag) wake_up_var(&sb->s_flags); } -bool super_iter_iget(struct inode *inode, int flags) +bool super_iter_iget(struct inode *inode) { - bool ret = false; + bool ret = false; spin_lock(&inode->i_lock); - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) - goto out_unlock; - - /* - * Skip over zero refcount inode if the caller only wants - * referenced inodes to be iterated. - */ - if ((flags & INO_ITER_REFERENCED) && - !atomic_read(&inode->i_count)) - goto out_unlock; - - __iget(inode); - ret = true; -out_unlock: + if (!(inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) { + __iget(inode); + ret = true; + } spin_unlock(&inode->i_lock); return ret; - } EXPORT_SYMBOL_GPL(super_iter_iget); @@ -216,7 +205,7 @@ int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn, spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - if (!super_iter_iget(inode, flags)) + if (!super_iter_iget(inode)) continue; spin_unlock(&sb->s_inode_list_lock); iput(old_inode); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index ee544556cee728..5a174e690424fb 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1654,8 +1654,7 @@ xfs_iter_vfs_igrab( if (ip->i_flags & XFS_ITER_VFS_NOGRAB_IFLAGS) goto out_unlock_noent; - if ((flags & INO_ITER_UNSAFE) || - super_iter_iget(inode, flags)) + if ((flags & INO_ITER_UNSAFE) || super_iter_iget(inode)) ret = true; out_unlock_noent: diff --git a/include/linux/fs.h b/include/linux/fs.h index 2aa335228b84bf..a3c682f0d94c1b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2224,7 +2224,7 @@ enum freeze_holder { typedef int (*ino_iter_fn)(struct inode *inode, void *priv); int super_iter_inodes(struct super_block *sb, ino_iter_fn iter_fn, void *private_data, int flags); -bool super_iter_iget(struct inode *inode, int flags); +bool super_iter_iget(struct inode *inode); struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb);
On Thu 03-10-24 05:39:23, Christoph Hellwig wrote: > On Thu, Oct 03, 2024 at 02:26:57PM +0200, Jan Kara wrote: > > On Thu 03-10-24 05:11:11, Christoph Hellwig wrote: > > > On Thu, Oct 03, 2024 at 01:57:21PM +0200, Jan Kara wrote: > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But > > > > I'd wait with that convertion until this series lands. > > > > > > I don't see how that has anything to do with iterators or not. > > > > Well, the patches would obviously conflict > > Conflict with what? I thought you wanted the interations to be unified in current state of code. If you meant after Dave's series, then we are in agreement. > > which seems pointless if we > > could live with three iterations for a few years until somebody noticed :). > > And with current Dave's version of iterators it will not be possible to > > integrate evict_inodes() iteration with the other two without a layering > > violation. Still we could go from 3 to 2 iterations. > > What layering violation? > > Below is quick compile tested part to do the fsnotify side and > get rid of the fsnotify iteration, which looks easily worth it. ... > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head) > */ > static int evict_inode_fn(struct inode *inode, void *data) > { > + struct super_block *sb = inode->i_sb; > struct list_head *dispose = data; > + bool post_unmount = !(sb->s_flags & SB_ACTIVE); > > spin_lock(&inode->i_lock); > - if (atomic_read(&inode->i_count) || > - (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) { > + if (atomic_read(&inode->i_count)) { > + spin_unlock(&inode->i_lock); > + > + /* for each watch, send FS_UNMOUNT and then remove it */ > + if (post_unmount && fsnotify_sb_info(sb)) { > + fsnotify_inode(inode, FS_UNMOUNT); > + fsnotify_inode_delete(inode); > + } This will not work because you are in unsafe iterator holding sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the iget / iput dance and releasing of s_inode_list_lock which does not work when a filesystem has its own inodes iterator AFAICT... That's why I've called it a layering violation. Honza > + return INO_ITER_DONE; > + } > + > + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { > spin_unlock(&inode->i_lock); > return INO_ITER_DONE; > }
On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote: > > + if (atomic_read(&inode->i_count)) { > > + spin_unlock(&inode->i_lock); > > + > > + /* for each watch, send FS_UNMOUNT and then remove it */ > > + if (post_unmount && fsnotify_sb_info(sb)) { > > + fsnotify_inode(inode, FS_UNMOUNT); > > + fsnotify_inode_delete(inode); > > + } > > This will not work because you are in unsafe iterator holding > sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the > iget / iput dance and releasing of s_inode_list_lock which does not work > when a filesystem has its own inodes iterator AFAICT... That's why I've > called it a layering violation. Ah, yes. So we'll need to special case it some way either way. Still feels saner to do it in one iteration and make the inode eviction not use the unsafe version, but maybe that's indeed better postponed until after Dave's series.
On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote: > On Thu 03-10-24 05:39:23, Christoph Hellwig wrote: > > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head) > > */ > > static int evict_inode_fn(struct inode *inode, void *data) > > { > > + struct super_block *sb = inode->i_sb; > > struct list_head *dispose = data; > > + bool post_unmount = !(sb->s_flags & SB_ACTIVE); > > > > spin_lock(&inode->i_lock); > > - if (atomic_read(&inode->i_count) || > > - (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) { > > + if (atomic_read(&inode->i_count)) { > > + spin_unlock(&inode->i_lock); > > + > > + /* for each watch, send FS_UNMOUNT and then remove it */ > > + if (post_unmount && fsnotify_sb_info(sb)) { > > + fsnotify_inode(inode, FS_UNMOUNT); > > + fsnotify_inode_delete(inode); > > + } > > This will not work because you are in unsafe iterator holding > sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the > iget / iput dance and releasing of s_inode_list_lock which does not work > when a filesystem has its own inodes iterator AFAICT... That's why I've > called it a layering violation. The whole point of the iget/iput dance is to stabilise the s_inodes list iteration whilst it is unlocked - the actual fsnotify calls don't need an inode reference to work correctly. IOWs, we don't need to run the fsnotify stuff right here - we can defer that like we do with the dispose list for all the inodes we mark as I_FREEING here. So if we pass a structure: struct evict_inode_args { struct list_head dispose; struct list_head fsnotify; }; If we use __iget() instead of requiring an inode state flag to keep the inode off the LRU for the fsnotify cleanup, then the code fragment above becomes: if (atomic_read(&inode->i_count)) { if (post_unmount && fsnotify_sb_info(sb)) { __iget(inode); inode_lru_list_del(inode); spin_unlock(&inode->i_lock); list_add(&inode->i_lru, &args->fsnotify); } return INO_ITER_DONE; } And then once we return to evict_inodes(), we do this: while (!list_empty(args->fsnotify)) { struct inode *inode inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); fsnotify_inode(inode, FS_UNMOUNT); fsnotify_inode_delete(inode); iput(inode); cond_resched(); } And so now all the fsnotify cleanup is done outside the traversal in one large batch from evict_inodes(). As for the landlock code, I think it needs to have it's own internal tracking mechanism and not search the sb inode list for inodes that it holds references to. LSM cleanup should be run before before we get to tearing down the inode cache, not after.... -Dave.
On Thu 03-10-24 23:59:51, Dave Chinner wrote: > On Thu, Oct 03, 2024 at 02:56:50PM +0200, Jan Kara wrote: > > On Thu 03-10-24 05:39:23, Christoph Hellwig wrote: > > > @@ -789,11 +789,23 @@ static bool dispose_list(struct list_head *head) > > > */ > > > static int evict_inode_fn(struct inode *inode, void *data) > > > { > > > + struct super_block *sb = inode->i_sb; > > > struct list_head *dispose = data; > > > + bool post_unmount = !(sb->s_flags & SB_ACTIVE); > > > > > > spin_lock(&inode->i_lock); > > > - if (atomic_read(&inode->i_count) || > > > - (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))) { > > > + if (atomic_read(&inode->i_count)) { > > > + spin_unlock(&inode->i_lock); > > > + > > > + /* for each watch, send FS_UNMOUNT and then remove it */ > > > + if (post_unmount && fsnotify_sb_info(sb)) { > > > + fsnotify_inode(inode, FS_UNMOUNT); > > > + fsnotify_inode_delete(inode); > > > + } > > > > This will not work because you are in unsafe iterator holding > > sb->s_inode_list_lock. To be able to call into fsnotify, you need to do the > > iget / iput dance and releasing of s_inode_list_lock which does not work > > when a filesystem has its own inodes iterator AFAICT... That's why I've > > called it a layering violation. > > The whole point of the iget/iput dance is to stabilise the > s_inodes list iteration whilst it is unlocked - the actual fsnotify > calls don't need an inode reference to work correctly. > > IOWs, we don't need to run the fsnotify stuff right here - we can > defer that like we do with the dispose list for all the inodes we > mark as I_FREEING here. > > So if we pass a structure: > > struct evict_inode_args { > struct list_head dispose; > struct list_head fsnotify; > }; > > If we use __iget() instead of requiring an inode state flag to keep > the inode off the LRU for the fsnotify cleanup, then the code > fragment above becomes: > > if (atomic_read(&inode->i_count)) { > if (post_unmount && fsnotify_sb_info(sb)) { > __iget(inode); > inode_lru_list_del(inode); > spin_unlock(&inode->i_lock); > list_add(&inode->i_lru, &args->fsnotify); > } Nit: Need to release i_lock in else branch here. Otherwise interesting idea. Yes, something like this could work even in unsafe iterator. > return INO_ITER_DONE; > } > And then once we return to evict_inodes(), we do this: > > while (!list_empty(args->fsnotify)) { > struct inode *inode > > inode = list_first_entry(head, struct inode, i_lru); > list_del_init(&inode->i_lru); > > fsnotify_inode(inode, FS_UNMOUNT); > fsnotify_inode_delete(inode); > iput(inode); > cond_resched(); > } > > And so now all the fsnotify cleanup is done outside the traversal in > one large batch from evict_inodes(). Yup. > As for the landlock code, I think it needs to have it's own internal > tracking mechanism and not search the sb inode list for inodes that > it holds references to. LSM cleanup should be run before before we > get to tearing down the inode cache, not after.... Well, I think LSM cleanup could in principle be handled together with the fsnotify cleanup but I didn't check the details. Honza
On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote: > On Thu 03-10-24 23:59:51, Dave Chinner wrote: > > As for the landlock code, I think it needs to have it's own internal > > tracking mechanism and not search the sb inode list for inodes that > > it holds references to. LSM cleanup should be run before before we > > get to tearing down the inode cache, not after.... > > Well, I think LSM cleanup could in principle be handled together with the > fsnotify cleanup but I didn't check the details. I'm not sure how we tell if an inode potentially has a LSM related reference hanging off it. The landlock code looks to make an assumption in that the only referenced inodes it sees will have a valid inode->i_security pointer if landlock is enabled. i.e. it calls landlock_inode(inode) and dereferences the returned value without ever checking if inode->i_security is NULL or not. I mean, we could do a check for inode->i_security when the refcount is elevated and replace the security_sb_delete hook with an security_evict_inode hook similar to the proposed fsnotify eviction from evict_inodes(). But screwing with LSM instructure looks .... obnoxiously complex from the outside... -Dave.
On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote: > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote: > > On Thu 03-10-24 23:59:51, Dave Chinner wrote: > > > As for the landlock code, I think it needs to have it's own internal > > > tracking mechanism and not search the sb inode list for inodes that > > > it holds references to. LSM cleanup should be run before before we > > > get to tearing down the inode cache, not after.... > > > > Well, I think LSM cleanup could in principle be handled together with the > > fsnotify cleanup but I didn't check the details. > > I'm not sure how we tell if an inode potentially has a LSM related > reference hanging off it. The landlock code looks to make an > assumption in that the only referenced inodes it sees will have a > valid inode->i_security pointer if landlock is enabled. i.e. it > calls landlock_inode(inode) and dereferences the returned value > without ever checking if inode->i_security is NULL or not. > > I mean, we could do a check for inode->i_security when the refcount > is elevated and replace the security_sb_delete hook with an > security_evict_inode hook similar to the proposed fsnotify eviction > from evict_inodes(). > > But screwing with LSM instructure looks .... obnoxiously complex > from the outside... Imho, please just focus on the immediate feedback and ignore all the extra bells and whistles that we could or should do. I prefer all of that to be done after this series lands.
On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote: > > But screwing with LSM instructure looks .... obnoxiously complex > > from the outside... > > Imho, please just focus on the immediate feedback and ignore all the > extra bells and whistles that we could or should do. I prefer all of > that to be done after this series lands. For the LSM mess: absolutely. For fsnotify it seems like Dave has a good idea to integrate it, and it removes the somewhat awkward need for the reffed flag. So if that delayed notify idea works out I'd prefer to see that in over the flag.
On Fri 04-10-24 05:14:36, Christoph Hellwig wrote: > On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote: > > > But screwing with LSM instructure looks .... obnoxiously complex > > > from the outside... > > > > Imho, please just focus on the immediate feedback and ignore all the > > extra bells and whistles that we could or should do. I prefer all of > > that to be done after this series lands. > > For the LSM mess: absolutely. For fsnotify it seems like Dave has > a good idea to integrate it, and it removes the somewhat awkward > need for the reffed flag. So if that delayed notify idea works out > I'd prefer to see that in over the flag. As I wrote in one of the emails in this (now huge) thread, I'm fine with completely dropping that inode->i_refcount check from the fsnotify_unmount_inodes(). It made sense when it was called before evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after evict_inodes") the usefulness of this check is rather doubtful. So we can drop the awkward flag regardless whether we unify evict_inodes() with fsnotify_unmount_inodes() or not. I have no strong preference whether the unification happens as part of this patch set or later on so it's up to Dave as far as I'm concerned. Honza
On Fri, Oct 4, 2024 at 9:49 AM Jan Kara <jack@suse.cz> wrote: > On Fri 04-10-24 05:14:36, Christoph Hellwig wrote: > > On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote: > > > > But screwing with LSM instructure looks .... obnoxiously complex > > > > from the outside... > > > > > > Imho, please just focus on the immediate feedback and ignore all the > > > extra bells and whistles that we could or should do. I prefer all of > > > that to be done after this series lands. > > > > For the LSM mess: absolutely. For fsnotify it seems like Dave has > > a good idea to integrate it, and it removes the somewhat awkward > > need for the reffed flag. So if that delayed notify idea works out > > I'd prefer to see that in over the flag. > > As I wrote in one of the emails in this (now huge) thread, I'm fine with > completely dropping that inode->i_refcount check from the > fsnotify_unmount_inodes(). It made sense when it was called before > evict_inodes() but after 1edc8eb2e931 ("fs: call fsnotify_sb_delete after > evict_inodes") the usefulness of this check is rather doubtful. So we can > drop the awkward flag regardless whether we unify evict_inodes() with > fsnotify_unmount_inodes() or not. I have no strong preference whether the > unification happens as part of this patch set or later on so it's up to > Dave as far as I'm concerned. I didn't get a chance to look at this thread until just now and I'm noticing that the email used for Mickaël is likely not the best, I'm adding the email he uses in MAINTAINERS as well as that of Günther Noack, a designated Landlock reviewer. Mickaël, Günther, the lore link for the full discussion is below: https://lore.kernel.org/all/Zv5GfY1WS_aaczZM@infradead.org
On Fri, Oct 04, 2024 at 09:21:19AM +0200, Christian Brauner wrote: > On Fri, Oct 04, 2024 at 10:46:27AM GMT, Dave Chinner wrote: > > On Thu, Oct 03, 2024 at 06:17:31PM +0200, Jan Kara wrote: > > > On Thu 03-10-24 23:59:51, Dave Chinner wrote: > > > > As for the landlock code, I think it needs to have it's own internal > > > > tracking mechanism and not search the sb inode list for inodes that > > > > it holds references to. LSM cleanup should be run before before we > > > > get to tearing down the inode cache, not after.... > > > > > > Well, I think LSM cleanup could in principle be handled together with the > > > fsnotify cleanup but I didn't check the details. > > > > I'm not sure how we tell if an inode potentially has a LSM related > > reference hanging off it. The landlock code looks to make an > > assumption in that the only referenced inodes it sees will have a > > valid inode->i_security pointer if landlock is enabled. i.e. it > > calls landlock_inode(inode) and dereferences the returned value > > without ever checking if inode->i_security is NULL or not. > > > > I mean, we could do a check for inode->i_security when the refcount > > is elevated and replace the security_sb_delete hook with an > > security_evict_inode hook similar to the proposed fsnotify eviction > > from evict_inodes(). > > > > But screwing with LSM instructure looks .... obnoxiously complex > > from the outside... > > Imho, please just focus on the immediate feedback and ignore all the > extra bells and whistles that we could or should do. I prefer all of > that to be done after this series lands. Actually, it's not as bad as I thought it was going to be. I've already moved both fsnotify and LSM inode eviction to evict_inodes() as preparatory patches... Dave Chinner (2): vfs: move fsnotify inode eviction to evict_inodes() vfs, lsm: rework lsm inode eviction at unmount fs/inode.c | 52 +++++++++++++--- fs/notify/fsnotify.c | 60 ------------------- fs/super.c | 8 +-- include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 2 +- security/landlock/fs.c | 134 ++++++++++-------------------------------- security/security.c | 31 ++++++---- 7 files changed, 99 insertions(+), 190 deletions(-) -Dave.
diff --git a/block/bdev.c b/block/bdev.c index b5a362156ca1..5f720e12f731 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1226,56 +1226,36 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise) */ EXPORT_SYMBOL_GPL(bdev_mark_dead); +static int sync_bdev_fn(struct inode *inode, void *data) +{ + struct block_device *bdev; + bool wait = *(bool *)data; + + if (inode->i_mapping->nrpages == 0) + return INO_ITER_DONE; + + bdev = I_BDEV(inode); + mutex_lock(&bdev->bd_disk->open_mutex); + if (!atomic_read(&bdev->bd_openers)) { + ; /* skip */ + } else if (wait) { + /* + * We keep the error status of individual mapping so + * that applications can catch the writeback error using + * fsync(2). See filemap_fdatawait_keep_errors() for + * details. + */ + filemap_fdatawait_keep_errors(inode->i_mapping); + } else { + filemap_fdatawrite(inode->i_mapping); + } + mutex_unlock(&bdev->bd_disk->open_mutex); + return INO_ITER_DONE; +} + void sync_bdevs(bool wait) { - struct inode *inode, *old_inode = NULL; - - spin_lock(&blockdev_superblock->s_inode_list_lock); - list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) { - struct address_space *mapping = inode->i_mapping; - struct block_device *bdev; - - spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) || - mapping->nrpages == 0) { - spin_unlock(&inode->i_lock); - continue; - } - __iget(inode); - spin_unlock(&inode->i_lock); - spin_unlock(&blockdev_superblock->s_inode_list_lock); - /* - * We hold a reference to 'inode' so it couldn't have been - * removed from s_inodes list while we dropped the - * s_inode_list_lock We cannot iput the inode now as we can - * be holding the last reference and we cannot iput it under - * s_inode_list_lock. So we keep the reference and iput it - * later. - */ - iput(old_inode); - old_inode = inode; - bdev = I_BDEV(inode); - - mutex_lock(&bdev->bd_disk->open_mutex); - if (!atomic_read(&bdev->bd_openers)) { - ; /* skip */ - } else if (wait) { - /* - * We keep the error status of individual mapping so - * that applications can catch the writeback error using - * fsync(2). See filemap_fdatawait_keep_errors() for - * details. - */ - filemap_fdatawait_keep_errors(inode->i_mapping); - } else { - filemap_fdatawrite(inode->i_mapping); - } - mutex_unlock(&bdev->bd_disk->open_mutex); - - spin_lock(&blockdev_superblock->s_inode_list_lock); - } - spin_unlock(&blockdev_superblock->s_inode_list_lock); - iput(old_inode); + super_iter_inodes(blockdev_superblock, sync_bdev_fn, &wait, 0); } /* diff --git a/fs/drop_caches.c b/fs/drop_caches.c index d45ef541d848..901cda15537f 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -16,36 +16,20 @@ /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; -static void drop_pagecache_sb(struct super_block *sb, void *unused) +static int invalidate_inode_fn(struct inode *inode, void *data) { - struct inode *inode, *toput_inode = NULL; - - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - spin_lock(&inode->i_lock); - /* - * We must skip inodes in unusual state. We may also skip - * inodes without pages but we deliberately won't in case - * we need to reschedule to avoid softlockups. - */ - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (mapping_empty(inode->i_mapping) && !need_resched())) { - spin_unlock(&inode->i_lock); - continue; - } - __iget(inode); - spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - + if (!mapping_empty(inode->i_mapping)) invalidate_mapping_pages(inode->i_mapping, 0, -1); - iput(toput_inode); - toput_inode = inode; + return INO_ITER_DONE; +} - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - spin_unlock(&sb->s_inode_list_lock); - iput(toput_inode); +/* + * Note: it would be nice to check mapping_empty() before we get a reference on + * the inode in super_iter_inodes(), but that's a future optimisation. + */ +static void drop_pagecache_sb(struct super_block *sb, void *unused) +{ + super_iter_inodes(sb, invalidate_inode_fn, NULL, 0); } int drop_caches_sysctl_handler(const struct ctl_table *table, int write, diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index e83d293c3614..f20862614ad6 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -1714,53 +1714,10 @@ static int gfs2_meta_init_fs_context(struct fs_context *fc) return 0; } -/** - * gfs2_evict_inodes - evict inodes cooperatively - * @sb: the superblock - * - * When evicting an inode with a zero link count, we are trying to upgrade the - * inode's iopen glock from SH to EX mode in order to determine if we can - * delete the inode. The other nodes are supposed to evict the inode from - * their caches if they can, and to poke the inode's inode glock if they cannot - * do so. Either behavior allows gfs2_upgrade_iopen_glock() to proceed - * quickly, but if the other nodes are not cooperating, the lock upgrading - * attempt will time out. Since inodes are evicted sequentially, this can add - * up quickly. - * - * Function evict_inodes() tries to keep the s_inode_list_lock list locked over - * a long time, which prevents other inodes from being evicted concurrently. - * This precludes the cooperative behavior we are looking for. This special - * version of evict_inodes() avoids that. - * - * Modeled after drop_pagecache_sb(). - */ -static void gfs2_evict_inodes(struct super_block *sb) +/* Nothing to do because we just want to bounce the inode through iput() */ +static int gfs2_evict_inode_fn(struct inode *inode, void *data) { - struct inode *inode, *toput_inode = NULL; - struct gfs2_sbd *sdp = sb->s_fs_info; - - set_bit(SDF_EVICTING, &sdp->sd_flags); - - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) && - !need_resched()) { - spin_unlock(&inode->i_lock); - continue; - } - atomic_inc(&inode->i_count); - spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - - iput(toput_inode); - toput_inode = inode; - - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - spin_unlock(&sb->s_inode_list_lock); - iput(toput_inode); + return INO_ITER_DONE; } static void gfs2_kill_sb(struct super_block *sb) @@ -1779,7 +1736,23 @@ static void gfs2_kill_sb(struct super_block *sb) sdp->sd_master_dir = NULL; shrink_dcache_sb(sb); - gfs2_evict_inodes(sb); + /* + * When evicting an inode with a zero link count, we are trying to + * upgrade the inode's iopen glock from SH to EX mode in order to + * determine if we can delete the inode. The other nodes are supposed + * to evict the inode from their caches if they can, and to poke the + * inode's inode glock if they cannot do so. Either behavior allows + * gfs2_upgrade_iopen_glock() to proceed quickly, but if the other nodes + * are not cooperating, the lock upgrading attempt will time out. Since + * inodes are evicted sequentially, this can add up quickly. + * + * evict_inodes() tries to keep the s_inode_list_lock list locked over a + * long time, which prevents other inodes from being evicted + * concurrently. This precludes the cooperative behavior we are looking + * for. + */ + set_bit(SDF_EVICTING, &sdp->sd_flags); + super_iter_inodes(sb, gfs2_evict_inode_fn, NULL, 0); /* * Flush and then drain the delete workqueue here (via diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 272c8a1dab3c..68c34ed94271 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -28,63 +28,14 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt) fsnotify_clear_marks_by_mount(mnt); } -/** - * fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes. - * @sb: superblock being unmounted. - * - * Called during unmount with no locks held, so needs to be safe against - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block. - */ -static void fsnotify_unmount_inodes(struct super_block *sb) +static int fsnotify_unmount_inode_fn(struct inode *inode, void *data) { - struct inode *inode, *iput_inode = NULL; + spin_unlock(&inode->i_lock); - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - /* - * We cannot __iget() an inode in state I_FREEING, - * I_WILL_FREE, or I_NEW which is fine because by that point - * the inode cannot have any associated watches. - */ - spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { - spin_unlock(&inode->i_lock); - continue; - } - - /* - * If i_count is zero, the inode cannot have any watches and - * doing an __iget/iput with SB_ACTIVE clear would actually - * evict all inodes with zero i_count from icache which is - * unnecessarily violent and may in fact be illegal to do. - * However, we should have been called /after/ evict_inodes - * removed all zero refcount inodes, in any case. Test to - * be sure. - */ - if (!atomic_read(&inode->i_count)) { - spin_unlock(&inode->i_lock); - continue; - } - - __iget(inode); - spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - - iput(iput_inode); - - /* for each watch, send FS_UNMOUNT and then remove it */ - fsnotify_inode(inode, FS_UNMOUNT); - - fsnotify_inode_delete(inode); - - iput_inode = inode; - - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - spin_unlock(&sb->s_inode_list_lock); - - iput(iput_inode); + /* for each watch, send FS_UNMOUNT and then remove it */ + fsnotify_inode(inode, FS_UNMOUNT); + fsnotify_inode_delete(inode); + return INO_ITER_DONE; } void fsnotify_sb_delete(struct super_block *sb) @@ -95,7 +46,19 @@ void fsnotify_sb_delete(struct super_block *sb) if (!sbinfo) return; - fsnotify_unmount_inodes(sb); + /* + * If i_count is zero, the inode cannot have any watches and + * doing an __iget/iput with SB_ACTIVE clear would actually + * evict all inodes with zero i_count from icache which is + * unnecessarily violent and may in fact be illegal to do. + * However, we should have been called /after/ evict_inodes + * removed all zero refcount inodes, in any case. Hence we use + * INO_ITER_REFERENCED to ensure zero refcount inodes are filtered + * properly. + */ + super_iter_inodes(sb, fsnotify_unmount_inode_fn, NULL, + INO_ITER_REFERENCED); + fsnotify_clear_marks_by_sb(sb); /* Wait for outstanding object references from connectors */ wait_var_event(fsnotify_sb_watched_objects(sb), diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index ea0bd807fed7..ea9fce7acd1b 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -1017,56 +1017,40 @@ static int dqinit_needed(struct inode *inode, int type) return 0; } +struct dquot_ref_data { + int type; + int reserved; +}; + +static int add_dquot_ref_fn(struct inode *inode, void *data) +{ + struct dquot_ref_data *ref = data; + int ret; + + if (!dqinit_needed(inode, ref->type)) + return INO_ITER_DONE; + +#ifdef CONFIG_QUOTA_DEBUG + if (unlikely(inode_get_rsv_space(inode) > 0)) + ref->reserved++; +#endif + ret = __dquot_initialize(inode, ref->type); + if (ret < 0) + return ret; + return INO_ITER_DONE; +} + /* This routine is guarded by s_umount semaphore */ static int add_dquot_ref(struct super_block *sb, int type) { - struct inode *inode, *old_inode = NULL; -#ifdef CONFIG_QUOTA_DEBUG - int reserved = 0; -#endif - int err = 0; - - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - !atomic_read(&inode->i_writecount) || - !dqinit_needed(inode, type)) { - spin_unlock(&inode->i_lock); - continue; - } - __iget(inode); - spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - -#ifdef CONFIG_QUOTA_DEBUG - if (unlikely(inode_get_rsv_space(inode) > 0)) - reserved = 1; -#endif - iput(old_inode); - err = __dquot_initialize(inode, type); - if (err) { - iput(inode); - goto out; - } + struct dquot_ref_data ref = { + .type = type, + }; + int err; - /* - * We hold a reference to 'inode' so it couldn't have been - * removed from s_inodes list while we dropped the - * s_inode_list_lock. We cannot iput the inode now as we can be - * holding the last reference and we cannot iput it under - * s_inode_list_lock. So we keep the reference and iput it - * later. - */ - old_inode = inode; - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - spin_unlock(&sb->s_inode_list_lock); - iput(old_inode); -out: + err = super_iter_inodes(sb, add_dquot_ref_fn, &ref, 0); #ifdef CONFIG_QUOTA_DEBUG - if (reserved) { + if (ref.reserved) { quota_error(sb, "Writes happened before quota was turned on " "thus quota information is probably inconsistent. " "Please run quotacheck(8)"); @@ -1075,11 +1059,6 @@ static int add_dquot_ref(struct super_block *sb, int type) return err; } -struct dquot_ref_data { - int type; - int reserved; -}; - static int remove_dquot_ref_fn(struct inode *inode, void *data) { struct dquot_ref_data *ref = data; diff --git a/security/landlock/fs.c b/security/landlock/fs.c index 7d79fc8abe21..013ec4017ddd 100644 --- a/security/landlock/fs.c +++ b/security/landlock/fs.c @@ -1223,109 +1223,60 @@ static void hook_inode_free_security_rcu(void *inode_security) /* * Release the inodes used in a security policy. - * - * Cf. fsnotify_unmount_inodes() and invalidate_inodes() */ +static int release_inode_fn(struct inode *inode, void *data) +{ + + rcu_read_lock(); + object = rcu_dereference(landlock_inode(inode)->object); + if (!object) { + rcu_read_unlock(); + return INO_ITER_DONE; + } + + /* + * If there is no concurrent release_inode() ongoing, then we + * are in charge of calling iput() on this inode, otherwise we + * will just wait for it to finish. + */ + spin_lock(&object->lock); + if (object->underobj != inode) { + spin_unlock(&object->lock); + rcu_read_unlock(); + return INO_ITER_DONE; + } + + object->underobj = NULL; + spin_unlock(&object->lock); + rcu_read_unlock(); + + /* + * Because object->underobj was not NULL, release_inode() and + * get_inode_object() guarantee that it is safe to reset + * landlock_inode(inode)->object while it is not NULL. It is therefore + * not necessary to lock inode->i_lock. + */ + rcu_assign_pointer(landlock_inode(inode)->object, NULL); + + /* + * At this point, we own the ihold() reference that was originally set + * up by get_inode_object() as well as the reference the inode iterator + * obtained before calling us. Therefore the following call to iput() + * will not sleep nor drop the inode because there is now at least two + * references to it. + */ + iput(inode); + return INO_ITER_DONE; +} + static void hook_sb_delete(struct super_block *const sb) { - struct inode *inode, *prev_inode = NULL; - if (!landlock_initialized) return; - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - struct landlock_object *object; + super_iter_inodes(sb, release_inode_fn, NULL, 0); - /* Only handles referenced inodes. */ - if (!atomic_read(&inode->i_count)) - continue; - - /* - * Protects against concurrent modification of inode (e.g. - * from get_inode_object()). - */ - spin_lock(&inode->i_lock); - /* - * Checks I_FREEING and I_WILL_FREE to protect against a race - * condition when release_inode() just called iput(), which - * could lead to a NULL dereference of inode->security or a - * second call to iput() for the same Landlock object. Also - * checks I_NEW because such inode cannot be tied to an object. - */ - if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) { - spin_unlock(&inode->i_lock); - continue; - } - - rcu_read_lock(); - object = rcu_dereference(landlock_inode(inode)->object); - if (!object) { - rcu_read_unlock(); - spin_unlock(&inode->i_lock); - continue; - } - /* Keeps a reference to this inode until the next loop walk. */ - __iget(inode); - spin_unlock(&inode->i_lock); - - /* - * If there is no concurrent release_inode() ongoing, then we - * are in charge of calling iput() on this inode, otherwise we - * will just wait for it to finish. - */ - spin_lock(&object->lock); - if (object->underobj == inode) { - object->underobj = NULL; - spin_unlock(&object->lock); - rcu_read_unlock(); - - /* - * Because object->underobj was not NULL, - * release_inode() and get_inode_object() guarantee - * that it is safe to reset - * landlock_inode(inode)->object while it is not NULL. - * It is therefore not necessary to lock inode->i_lock. - */ - rcu_assign_pointer(landlock_inode(inode)->object, NULL); - /* - * At this point, we own the ihold() reference that was - * originally set up by get_inode_object() and the - * __iget() reference that we just set in this loop - * walk. Therefore the following call to iput() will - * not sleep nor drop the inode because there is now at - * least two references to it. - */ - iput(inode); - } else { - spin_unlock(&object->lock); - rcu_read_unlock(); - } - - if (prev_inode) { - /* - * At this point, we still own the __iget() reference - * that we just set in this loop walk. Therefore we - * can drop the list lock and know that the inode won't - * disappear from under us until the next loop walk. - */ - spin_unlock(&sb->s_inode_list_lock); - /* - * We can now actually put the inode reference from the - * previous loop walk, which is not needed anymore. - */ - iput(prev_inode); - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - prev_inode = inode; - } - spin_unlock(&sb->s_inode_list_lock); - - /* Puts the inode reference from the last loop walk, if any. */ - if (prev_inode) - iput(prev_inode); - /* Waits for pending iput() in release_inode(). */ + /* Waits for pending iput()s in release_inode(). */ wait_var_event(&landlock_superblock(sb)->inode_refs, !atomic_long_read(&landlock_superblock(sb)->inode_refs)); }