Message ID | 171270972068.3634974.15204601732623547015.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/3] xfs: reduce the rate of cond_resched calls inside scrub | expand |
> + /* > + * If the caller wants us to do a scrub-by-handle and the file used to > + * call the ioctl is not the same file, load the incore inode and pin > + * it across all the scrubv actions to avoid repeated UNTRUSTED > + * lookups. The reference is not passed to deeper layers of scrub > + * because each scrubber gets to decide its own strategy for getting an > + * inode. > + */ > + if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino) > + handle_ip = xchk_scrubv_open_by_handle(mp, vhead); Oh. So we read the inode, keep a reference to it, but still hit the inode cache every time. A little non-onvious and not perfect for performance, but based on your numbers probably good enough. Curious: what is the reason the scrubbers want/need different ways to get at the inode? > + /* > + * If we're holding the only reference to an inode opened via handle > + * and the scan was clean, mark it dontcache so that we don't pollute > + * the cache. > + */ > + if (handle_ip) { > + if (set_dontcache && > + atomic_read(&VFS_I(handle_ip)->i_count) == 1) > + d_mark_dontcache(VFS_I(handle_ip)); > + xfs_irele(handle_ip); > + } This looks a little weird to me. Can't we simply use XFS_IGET_DONTCACHE at iget time and then clear I_DONTCACHE here if we want to keep the inode around? Given that we only set the uncached flag from XFS_IGET_DONTCACHE on a cache miss, we won't have set DCACHE_DONTCACHE anywhere (and don't really care about the dentries to start with). But why do we care about keeping the inodes with errors in memory here, but not elsewhere? Maybe this can be explained in an expanded comment.
On Wed, Apr 10, 2024 at 08:12:16AM -0700, Christoph Hellwig wrote: > > + /* > > + * If the caller wants us to do a scrub-by-handle and the file used to > > + * call the ioctl is not the same file, load the incore inode and pin > > + * it across all the scrubv actions to avoid repeated UNTRUSTED > > + * lookups. The reference is not passed to deeper layers of scrub > > + * because each scrubber gets to decide its own strategy for getting an > > + * inode. > > + */ > > + if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino) > > + handle_ip = xchk_scrubv_open_by_handle(mp, vhead); > > Oh. So we read the inode, keep a reference to it, but still hit the > inode cache every time. A little non-onvious and not perfect for > performance, but based on your numbers probably good enough. > > Curious: what is the reason the scrubbers want/need different ways to > get at the inode? I don't remember the exact reason why we don't pass this ip into xfs_scrub_metadata, but iirc the inode scrub setup functions react differently (from the bmap/dir/attr/symlink scrubbers) when iget failures occur. Also this way xfs_scrub_metadata owns the refcount to whatever inode it picks up, and can do whatever it wants with that reference. > > + /* > > + * If we're holding the only reference to an inode opened via handle > > + * and the scan was clean, mark it dontcache so that we don't pollute > > + * the cache. > > + */ > > + if (handle_ip) { > > + if (set_dontcache && > > + atomic_read(&VFS_I(handle_ip)->i_count) == 1) > > + d_mark_dontcache(VFS_I(handle_ip)); > > + xfs_irele(handle_ip); > > + } > > This looks a little weird to me. Can't we simply use XFS_IGET_DONTCACHE > at iget time and then clear I_DONTCACHE here if we want to keep the > inode around? Not anymore, because other threads can mess around with the dontcache state (yay fsdax access path changes!!) while we are scrubbing the inode. > Given that we only set the uncached flag from > XFS_IGET_DONTCACHE on a cache miss, we won't have set > DCACHE_DONTCACHE anywhere (and don't really care about the dentries to > start with). > > But why do we care about keeping the inodes with errors in memory > here, but not elsewhere? We actually, do, but it's not obvious... > Maybe this can be explained in an expanded comment. ...because this bit here is basically the same as xchk_irele, but we don't have a xfs_scrub object to pass in, so it's opencoded. I could pull this logic out into: void xfs_scrub_irele(struct xfs_inode *ip) { if (atomic_read(&VFS_I(ip)->i_count) == 1) { /* * If this is the last reference to the inode and the caller * permits it, set DONTCACHE to avoid thrashing. */ d_mark_dontcache(VFS_I(ip)); } xfs_irele(ip); } and change xchk_irele to: void xchk_irele( struct xfs_scrub *sc, struct xfs_inode *ip) { if (sc->tp) { spin_lock(&VFS_I(ip)->i_lock); VFS_I(ip)->i_state &= ~I_DONTCACHE; spin_unlock(&VFS_I(ip)->i_lock); xfs_irele(ip); return; } xfs_scrub_irele(ip); }
On Wed, Apr 10, 2024 at 06:15:02PM -0700, Darrick J. Wong wrote: > > This looks a little weird to me. Can't we simply use XFS_IGET_DONTCACHE > > at iget time and then clear I_DONTCACHE here if we want to keep the > > inode around? > > Not anymore, because other threads can mess around with the dontcache > state (yay fsdax access path changes!!) while we are scrubbing the > inode. You mean xfs_ioctl_setattr_prepare_dax? Oh lovely, a completely undocumented d_mark_dontcache in a completely non-obvious place. It sems to have appeared in commit e4f9ba20d3b8c2b86ec71f326882e1a3c4e47953 Author: Ira Weiny <ira.weiny@intel.com> Date: Thu Apr 30 07:41:38 2020 -0700 fs/xfs: Update xfs_ioctl_setattr_dax_invalidate() without any explanation either. And I can't see any reason why we'd prevent inodes and dentries to be cached after DAX mode switches to start with. I can only guess, maybe the commit thinks d_mark_dontcache is about data caching? > > > Given that we only set the uncached flag from > > XFS_IGET_DONTCACHE on a cache miss, we won't have set > > DCACHE_DONTCACHE anywhere (and don't really care about the dentries to > > start with). > > > > But why do we care about keeping the inodes with errors in memory > > here, but not elsewhere? > > We actually, do, but it's not obvious... > > > Maybe this can be explained in an expanded comment. > > ...because this bit here is basically the same as xchk_irele, but we > don't have a xfs_scrub object to pass in, so it's opencoded. I could > pull this logic out into: Eww, I hadn't seen xchk_irele before. To me it looks like I_DONTCACHE/d_mark_dontcache is really the wrong vehicle here. I'd instead have a XFS_IGET_SCRUB, which will set an XFS_ISCRUB or whatever flag on a cache miss. Any cache hit without XFS_IGET_SCRUB will clear it. ->drop_inode then always returns true for XFS_ISCRUB inodes unless in a transaction. Talking about the in transaction part - why do we drop inodes in the transaction in scrub, but not elsewhere?
On Wed, Apr 10, 2024 at 08:49:39PM -0700, Christoph Hellwig wrote: > On Wed, Apr 10, 2024 at 06:15:02PM -0700, Darrick J. Wong wrote: > > > This looks a little weird to me. Can't we simply use XFS_IGET_DONTCACHE > > > at iget time and then clear I_DONTCACHE here if we want to keep the > > > inode around? > > > > Not anymore, because other threads can mess around with the dontcache > > state (yay fsdax access path changes!!) while we are scrubbing the > > inode. > > You mean xfs_ioctl_setattr_prepare_dax? Oh lovely, a completely > undocumented d_mark_dontcache in a completely non-obvious place. > > It sems to have appeared in > commit e4f9ba20d3b8c2b86ec71f326882e1a3c4e47953 > Author: Ira Weiny <ira.weiny@intel.com> > Date: Thu Apr 30 07:41:38 2020 -0700 > > fs/xfs: Update xfs_ioctl_setattr_dax_invalidate() > > without any explanation either. And I can't see any reason why > we'd prevent inodes and dentries to be cached after DAX mode > switches to start with. I can only guess, maybe the commit thinks > d_mark_dontcache is about data caching? It's the horrible way that fsdax "supports" switching the address ops and i_mapping contents at runtime -- set the ondisk iflag, mark the inode/dentry for immediate explusion, wait for reclaim to eat the inode, then reload it and *presto* new incore iflag and state! (It's gross but I don't know of a better way to drain i_mapping and change address ops and at this point I'm hoping I just plain forget all that pmem stuff. :P) > > > > > Given that we only set the uncached flag from > > > XFS_IGET_DONTCACHE on a cache miss, we won't have set > > > DCACHE_DONTCACHE anywhere (and don't really care about the dentries to > > > start with). > > > > > > But why do we care about keeping the inodes with errors in memory > > > here, but not elsewhere? > > > > We actually, do, but it's not obvious... > > > > > Maybe this can be explained in an expanded comment. > > > > ...because this bit here is basically the same as xchk_irele, but we > > don't have a xfs_scrub object to pass in, so it's opencoded. I could > > pull this logic out into: > > Eww, I hadn't seen xchk_irele before. To me it looks like > I_DONTCACHE/d_mark_dontcache is really the wrong vehicle here. > > I'd instead have a XFS_IGET_SCRUB, which will set an XFS_ISCRUB or > whatever flag on a cache miss. Any cache hit without XFS_IGET_SCRUB > will clear it. > > ->drop_inode then always returns true for XFS_ISCRUB inodes unless > in a transaction. How does it determine that we're in a transaction? We just stopped storing transactions in current->journal_info due to problems with nested transactions and ext4 assuming that it can blind deref that. > Talking about the in transaction part - why do > we drop inodes in the transaction in scrub, but not elsewhere? One example is: Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to find rmap records -> iget/irele. --D
> How does it determine that we're in a transaction? We just stopped > storing transactions in current->journal_info due to problems with > nested transactions and ext4 assuming that it can blind deref that. Oh, I was looking at an old tree and missed that. Well, someone needs to own it, it's just not just ext4 but could us. > > > Talking about the in transaction part - why do > > we drop inodes in the transaction in scrub, but not elsewhere? > > One example is: > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to > find rmap records -> iget/irele. So this is just the magic empty transaction?
On Wed, Apr 10, 2024 at 09:52:41PM -0700, Christoph Hellwig wrote: > > How does it determine that we're in a transaction? We just stopped > > storing transactions in current->journal_info due to problems with > > nested transactions and ext4 assuming that it can blind deref that. > > Oh, I was looking at an old tree and missed that. It's not in my tree but I did ... oh crap that already got committed; I need to rip out that part of xrep_trans_alloc_hook_dummy now. > Well, someone needs to own it, it's just not just ext4 but could us. Er... I don't understand this? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > Talking about the in transaction part - why do > > > we drop inodes in the transaction in scrub, but not elsewhere? > > > > One example is: > > > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to > > find rmap records -> iget/irele. > > So this is just the magic empty transaction? No, that's the fully featured repair transaction that will eventually be used to write/commit the new rmap tree. --D
On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote: > > Well, someone needs to own it, it's just not just ext4 but could us. > > Er... I don't understand this? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If we set current->journal and take a page faul we could not just recurse into ext4 but into any fs including XFS. Any everyone blindly dereferences is as only one fs can own it. > > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to > > > find rmap records -> iget/irele. > > > > So this is just the magic empty transaction? > > No, that's the fully featured repair transaction that will eventually be > used to write/commit the new rmap tree. That seems a bit dangerous to me. I guess we rely on the code inside the transaction context to never race with unmount as lack of SB_ACTIVE will make the VFS ignore the dontcache flag.
On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote: > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote: > > > Well, someone needs to own it, it's just not just ext4 but could us. > > > > Er... I don't understand this? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > If we set current->journal and take a page faul we could not just > recurse into ext4 but into any fs including XFS. Any everyone > blindly dereferences is as only one fs can own it. Well back before we ripped it out I had said that XFS should just set current->journal to 1 to prevent memory corruption but then Jan Kara noted that ext4 changes its behavior wrt jbd2 if it sees nonzero current->journal. That's why Dave dropped it entirely. > > > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to > > > > find rmap records -> iget/irele. > > > > > > So this is just the magic empty transaction? > > > > No, that's the fully featured repair transaction that will eventually be > > used to write/commit the new rmap tree. > > That seems a bit dangerous to me. I guess we rely on the code inside > the transaction context to never race with unmount as lack of SB_ACTIVE > will make the VFS ignore the dontcache flag. That and we have an open fd to call the ioctl so any unmount will fail, and we can't enter scrub if unmount already starte. --D
On Wed, Apr 10, 2024 at 10:21:07PM -0700, Darrick J. Wong wrote: > On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote: > > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote: > > > > Well, someone needs to own it, it's just not just ext4 but could us. > > > > > > Er... I don't understand this? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > If we set current->journal and take a page faul we could not just > > recurse into ext4 but into any fs including XFS. Any everyone > > blindly dereferences is as only one fs can own it. > > Well back before we ripped it out I had said that XFS should just set > current->journal to 1 to prevent memory corruption but then Jan Kara > noted that ext4 changes its behavior wrt jbd2 if it sees nonzero > current->journal. That's why Dave dropped it entirely. If you are in a fs context you own current->journal_info. But you also must make sure to not copy from and especially to user to not recurse into another file system. A per-thread field can't work any other way. So what ext4 is doing here is perfectly fine. What XFS did was to set current->journal_info and then cause page faults, which is not ok. I'm glad we fixed it. > > That seems a bit dangerous to me. I guess we rely on the code inside > > the transaction context to never race with unmount as lack of SB_ACTIVE > > will make the VFS ignore the dontcache flag. > > That and we have an open fd to call the ioctl so any unmount will fail, > and we can't enter scrub if unmount already starte. Indeed. So I'm still confused on why this new code keeps the inode around if an error happend, but xchk_irele does not. What is the benefit of keeping the inode around here? Why des it not apply to xchk_irele? I also don't understand how d_mark_dontcache in xfs_ioctl_setattr_prepare_dax is supposed to work. It'll make the inode go away quicker than without, but it can't force the inode by itself. I'm also lot on the interaction of that with the scrub inodes due to both above. I'd still expect any scrub iget to set uncached for a cache miss. If we then need to keep the inode around in transaction context we just keep it. What is the benefit of playing racing games with i_count to delay setting the dontcache flag until irele? And why does the DAX mess matter for that? Maybe I'm just thick and this is all obvious, but then it needs to be documented in detailed comments.
On Thu, Apr 11, 2024 at 07:02:07AM -0700, Christoph Hellwig wrote: > On Wed, Apr 10, 2024 at 10:21:07PM -0700, Darrick J. Wong wrote: > > On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote: > > > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote: > > > > > Well, someone needs to own it, it's just not just ext4 but could us. > > > > > > > > Er... I don't understand this? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > If we set current->journal and take a page faul we could not just > > > recurse into ext4 but into any fs including XFS. Any everyone > > > blindly dereferences is as only one fs can own it. > > > > Well back before we ripped it out I had said that XFS should just set > > current->journal to 1 to prevent memory corruption but then Jan Kara > > noted that ext4 changes its behavior wrt jbd2 if it sees nonzero > > current->journal. That's why Dave dropped it entirely. > > If you are in a fs context you own current->journal_info. But you > also must make sure to not copy from and especially to user to not > recurse into another file system. A per-thread field can't work any > other way. So what ext4 is doing here is perfectly fine. What XFS > did was to set current->journal_info and then cause page faults, which > is not ok. I'm glad we fixed it. > > > > That seems a bit dangerous to me. I guess we rely on the code inside > > > the transaction context to never race with unmount as lack of SB_ACTIVE > > > will make the VFS ignore the dontcache flag. > > > > That and we have an open fd to call the ioctl so any unmount will fail, > > and we can't enter scrub if unmount already starte. > > Indeed. > > So I'm still confused on why this new code keeps the inode around if an > error happend, but xchk_irele does not. What is the benefit of keeping > the inode around here? Why des it not apply to xchk_irele? OH! Crap, I forgot that some years ago (after the creation of the vectorized scrub patch) I cleaned up that behavior -- previously scrub actually did play games with clearing dontcache if the inode was sick. Then Dave pointed out that we could just change reclaim not to purge the incore inode (and hence preserve the health state) until unmount or the fs goes down, and clear I_DONTCACHE any time we notice bad metadata. Hopefully the incore inode then survives long enough that anyone scanning for filesystem health status will still see the badness state. Therefore, we don't need the set_dontcache variable in this patch: /* * If we're holding the only reference to an inode opened via * handle, mark it dontcache so that we don't pollute the cache. */ if (handle_ip) { if (atomic_read(&VFS_I(handle_ip)->i_count) == 1) d_mark_dontcache(VFS_I(handle_ip)); xfs_irele(handle_ip); } > I also don't understand how d_mark_dontcache in > xfs_ioctl_setattr_prepare_dax is supposed to work. It'll make the inode > go away quicker than without, but it can't force the inode by itself. That's correct. You can set the ondisk fsdax iflag and then wait centuries for the incore fsdax to catch up. I think this is a very marginal design, but thankfully the intended design is that you set daxinherit on the parent dir or mount with dax=always and all new files just come up with both fsdax flags set. > I'm also lot on the interaction of that with the scrub inodes due to > both above. I'd still expect any scrub iget to set uncached for > a cache miss. If we then need to keep the inode around in transaction > context we just keep it. What is the benefit of playing racing > games with i_count to delay setting the dontcache flag until irele? One thing I don't like about XFS_IGET_DONTCACHE is that a concurrent iget call without DONTCACHE won't clear the state from the inode, which can lead to unnecessary evictions. This racy thing means we only set it if we think the inode isn't in use anywhere else. At this point, though, I think we could add XFS_IGET_DONTCACHE to all the iget calls and drop the irele dontcache thing. > And why does the DAX mess matter for that? fsdax doesn't matter, it merely slurped up the functionality and now we're tangled up in it. > Maybe I'm just thick and this is all obvious, but then it needs to > be documented in detailed comments. No, it's ... very twisty and weird. I wish dontcache had remained our private xfs thing. --D
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index cab34823f3c91..f1a17f986b6cf 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -761,6 +761,31 @@ xfs_scrubv_previous_failures( return false; } +/* + * If the caller provided us with a nonzero inode number that isn't the ioctl + * file, try to grab a reference to it to eliminate all further untrusted inode + * lookups. If we can't get the inode, let each scrub function try again. + */ +STATIC struct xfs_inode * +xchk_scrubv_open_by_handle( + struct xfs_mount *mp, + const struct xfs_scrub_vec_head *vhead) +{ + struct xfs_inode *ip; + int error; + + error = xfs_iget(mp, NULL, vhead->svh_ino, XFS_IGET_UNTRUSTED, 0, &ip); + if (error) + return NULL; + + if (VFS_I(ip)->i_generation != vhead->svh_gen) { + xfs_irele(ip); + return NULL; + } + + return ip; +} + /* Vectored scrub implementation to reduce ioctl calls. */ int xfs_scrubv_metadata( @@ -769,7 +794,9 @@ xfs_scrubv_metadata( { struct xfs_inode *ip_in = XFS_I(file_inode(file)); struct xfs_mount *mp = ip_in->i_mount; + struct xfs_inode *handle_ip = NULL; struct xfs_scrub_vec *v; + bool set_dontcache = false; unsigned int i; int error = 0; @@ -788,9 +815,28 @@ xfs_scrubv_metadata( (v->sv_flags & ~XFS_SCRUB_FLAGS_OUT)) return -EINVAL; + /* + * If we detect at least one inode-type scrub, we might + * consider setting dontcache at the end. + */ + if (v->sv_type < XFS_SCRUB_TYPE_NR && + meta_scrub_ops[v->sv_type].type == ST_INODE) + set_dontcache = true; + trace_xchk_scrubv_item(mp, vhead, v); } + /* + * If the caller wants us to do a scrub-by-handle and the file used to + * call the ioctl is not the same file, load the incore inode and pin + * it across all the scrubv actions to avoid repeated UNTRUSTED + * lookups. The reference is not passed to deeper layers of scrub + * because each scrubber gets to decide its own strategy for getting an + * inode. + */ + if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino) + handle_ip = xchk_scrubv_open_by_handle(mp, vhead); + /* Run all the scrubbers. */ for (i = 0, v = vhead->svh_vecs; i < vhead->svh_nr; i++, v++) { struct xfs_scrub_metadata sm = { @@ -814,6 +860,10 @@ xfs_scrubv_metadata( v->sv_ret = xfs_scrub_metadata(file, &sm); v->sv_flags = sm.sm_flags; + /* Leave the inode in memory if something's wrong with it. */ + if (xchk_needs_repair(&sm)) + set_dontcache = false; + if (vhead->svh_rest_us) { ktime_t expires; @@ -828,5 +878,16 @@ xfs_scrubv_metadata( } } + /* + * If we're holding the only reference to an inode opened via handle + * and the scan was clean, mark it dontcache so that we don't pollute + * the cache. + */ + if (handle_ip) { + if (set_dontcache && + atomic_read(&VFS_I(handle_ip)->i_count) == 1) + d_mark_dontcache(VFS_I(handle_ip)); + xfs_irele(handle_ip); + } return error; }