diff mbox series

[3/3] xfs: only iget the file once when doing vectored scrub-by-handle

Message ID 171270972068.3634974.15204601732623547015.stgit@frogsfrogsfrogs (mailing list archive)
State Superseded
Headers show
Series [1/3] xfs: reduce the rate of cond_resched calls inside scrub | expand

Commit Message

Darrick J. Wong April 10, 2024, 1:09 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

If a program wants us to perform a scrub on a file handle and the fd
passed to ioctl() is not the file referenced in the handle, iget the
file once and pass it into the scrub code.  This amortizes the untrusted
iget lookup over /all/ the scrubbers mentioned in the scrubv call.

When running fstests in "rebuild all metadata after each test" mode, I
observed a 10% reduction in runtime on account of avoiding repeated
inobt lookups.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/scrub.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

Comments

Christoph Hellwig April 10, 2024, 3:12 p.m. UTC | #1
> +	/*
> +	 * If the caller wants us to do a scrub-by-handle and the file used to
> +	 * call the ioctl is not the same file, load the incore inode and pin
> +	 * it across all the scrubv actions to avoid repeated UNTRUSTED
> +	 * lookups.  The reference is not passed to deeper layers of scrub
> +	 * because each scrubber gets to decide its own strategy for getting an
> +	 * inode.
> +	 */
> +	if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino)
> +		handle_ip = xchk_scrubv_open_by_handle(mp, vhead);

Oh.  So we read the inode, keep a reference to it, but still hit the
inode cache every time.  A little non-onvious and not perfect for
performance, but based on your numbers probably good enough.

Curious: what is the reason the scrubbers want/need different ways to
get at the inode?

> +	/*
> +	 * If we're holding the only reference to an inode opened via handle
> +	 * and the scan was clean, mark it dontcache so that we don't pollute
> +	 * the cache.
> +	 */
> +	if (handle_ip) {
> +		if (set_dontcache &&
> +		    atomic_read(&VFS_I(handle_ip)->i_count) == 1)
> +			d_mark_dontcache(VFS_I(handle_ip));
> +		xfs_irele(handle_ip);
> +	}

This looks a little weird to me.  Can't we simply use XFS_IGET_DONTCACHE
at iget time and then clear I_DONTCACHE here if we want to keep the
inode around?  Given that we only set the uncached flag from
XFS_IGET_DONTCACHE on a cache miss, we won't have set
DCACHE_DONTCACHE anywhere (and don't really care about the dentries to
start with).

But why do we care about keeping the inodes with errors in memory
here, but not elsewhere?  Maybe this can be explained in an expanded
comment.
Darrick J. Wong April 11, 2024, 1:15 a.m. UTC | #2
On Wed, Apr 10, 2024 at 08:12:16AM -0700, Christoph Hellwig wrote:
> > +	/*
> > +	 * If the caller wants us to do a scrub-by-handle and the file used to
> > +	 * call the ioctl is not the same file, load the incore inode and pin
> > +	 * it across all the scrubv actions to avoid repeated UNTRUSTED
> > +	 * lookups.  The reference is not passed to deeper layers of scrub
> > +	 * because each scrubber gets to decide its own strategy for getting an
> > +	 * inode.
> > +	 */
> > +	if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino)
> > +		handle_ip = xchk_scrubv_open_by_handle(mp, vhead);
> 
> Oh.  So we read the inode, keep a reference to it, but still hit the
> inode cache every time.  A little non-onvious and not perfect for
> performance, but based on your numbers probably good enough.
> 
> Curious: what is the reason the scrubbers want/need different ways to
> get at the inode?

I don't remember the exact reason why we don't pass this ip into
xfs_scrub_metadata, but iirc the inode scrub setup functions react
differently (from the bmap/dir/attr/symlink scrubbers) when iget
failures occur.

Also this way xfs_scrub_metadata owns the refcount to whatever inode it
picks up, and can do whatever it wants with that reference.

> > +	/*
> > +	 * If we're holding the only reference to an inode opened via handle
> > +	 * and the scan was clean, mark it dontcache so that we don't pollute
> > +	 * the cache.
> > +	 */
> > +	if (handle_ip) {
> > +		if (set_dontcache &&
> > +		    atomic_read(&VFS_I(handle_ip)->i_count) == 1)
> > +			d_mark_dontcache(VFS_I(handle_ip));
> > +		xfs_irele(handle_ip);
> > +	}
> 
> This looks a little weird to me.  Can't we simply use XFS_IGET_DONTCACHE
> at iget time and then clear I_DONTCACHE here if we want to keep the
> inode around?

Not anymore, because other threads can mess around with the dontcache
state (yay fsdax access path changes!!) while we are scrubbing the
inode.

>                Given that we only set the uncached flag from
> XFS_IGET_DONTCACHE on a cache miss, we won't have set
> DCACHE_DONTCACHE anywhere (and don't really care about the dentries to
> start with).
> 
> But why do we care about keeping the inodes with errors in memory
> here, but not elsewhere?

We actually, do, but it's not obvious...

> Maybe this can be explained in an expanded comment.

...because this bit here is basically the same as xchk_irele, but we
don't have a xfs_scrub object to pass in, so it's opencoded.  I could
pull this logic out into:

void xfs_scrub_irele(struct xfs_inode *ip)
{
	if (atomic_read(&VFS_I(ip)->i_count) == 1) {
		/*
		 * If this is the last reference to the inode and the caller
		 * permits it, set DONTCACHE to avoid thrashing.
		 */
		d_mark_dontcache(VFS_I(ip));
	}
	xfs_irele(ip);
}

and change xchk_irele to:

void
xchk_irele(
	struct xfs_scrub	*sc,
	struct xfs_inode	*ip)
{
	if (sc->tp) {
		spin_lock(&VFS_I(ip)->i_lock);
		VFS_I(ip)->i_state &= ~I_DONTCACHE;
		spin_unlock(&VFS_I(ip)->i_lock);
		xfs_irele(ip);
		return;
	}

	xfs_scrub_irele(ip);
}
Christoph Hellwig April 11, 2024, 3:49 a.m. UTC | #3
On Wed, Apr 10, 2024 at 06:15:02PM -0700, Darrick J. Wong wrote:
> > This looks a little weird to me.  Can't we simply use XFS_IGET_DONTCACHE
> > at iget time and then clear I_DONTCACHE here if we want to keep the
> > inode around?
> 
> Not anymore, because other threads can mess around with the dontcache
> state (yay fsdax access path changes!!) while we are scrubbing the
> inode.

You mean xfs_ioctl_setattr_prepare_dax?  Oh lovely, a completely
undocumented d_mark_dontcache in a completely non-obvious place.

It sems to have appeared in
commit e4f9ba20d3b8c2b86ec71f326882e1a3c4e47953
Author: Ira Weiny <ira.weiny@intel.com>
Date:   Thu Apr 30 07:41:38 2020 -0700

    fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()

without any explanation either.  And I can't see any reason why
we'd prevent inodes and dentries to be cached after DAX mode
switches to start with.  I can only guess, maybe the commit thinks
d_mark_dontcache is about data caching?

> 
> >                Given that we only set the uncached flag from
> > XFS_IGET_DONTCACHE on a cache miss, we won't have set
> > DCACHE_DONTCACHE anywhere (and don't really care about the dentries to
> > start with).
> > 
> > But why do we care about keeping the inodes with errors in memory
> > here, but not elsewhere?
> 
> We actually, do, but it's not obvious...
> 
> > Maybe this can be explained in an expanded comment.
> 
> ...because this bit here is basically the same as xchk_irele, but we
> don't have a xfs_scrub object to pass in, so it's opencoded.  I could
> pull this logic out into:

Eww, I hadn't seen xchk_irele before.  To me it looks like
I_DONTCACHE/d_mark_dontcache is really the wrong vehicle here.

I'd instead have a XFS_IGET_SCRUB, which will set an XFS_ISCRUB or
whatever flag on a cache miss.  Any cache hit without XFS_IGET_SCRUB
will clear it.

->drop_inode then always returns true for XFS_ISCRUB inodes unless
in a transaction.  Talking about the in transaction part - why do
we drop inodes in the transaction in scrub, but not elsewhere?
Darrick J. Wong April 11, 2024, 4:41 a.m. UTC | #4
On Wed, Apr 10, 2024 at 08:49:39PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 10, 2024 at 06:15:02PM -0700, Darrick J. Wong wrote:
> > > This looks a little weird to me.  Can't we simply use XFS_IGET_DONTCACHE
> > > at iget time and then clear I_DONTCACHE here if we want to keep the
> > > inode around?
> > 
> > Not anymore, because other threads can mess around with the dontcache
> > state (yay fsdax access path changes!!) while we are scrubbing the
> > inode.
> 
> You mean xfs_ioctl_setattr_prepare_dax?  Oh lovely, a completely
> undocumented d_mark_dontcache in a completely non-obvious place.
> 
> It sems to have appeared in
> commit e4f9ba20d3b8c2b86ec71f326882e1a3c4e47953
> Author: Ira Weiny <ira.weiny@intel.com>
> Date:   Thu Apr 30 07:41:38 2020 -0700
> 
>     fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()
> 
> without any explanation either.  And I can't see any reason why
> we'd prevent inodes and dentries to be cached after DAX mode
> switches to start with.  I can only guess, maybe the commit thinks
> d_mark_dontcache is about data caching?

It's the horrible way that fsdax "supports" switching the address ops
and i_mapping contents at runtime -- set the ondisk iflag, mark the
inode/dentry for immediate explusion, wait for reclaim to eat the inode,
then reload it and *presto* new incore iflag and state!

(It's gross but I don't know of a better way to drain i_mapping and
change address ops and at this point I'm hoping I just plain forget all
that pmem stuff. :P)

> > 
> > >                Given that we only set the uncached flag from
> > > XFS_IGET_DONTCACHE on a cache miss, we won't have set
> > > DCACHE_DONTCACHE anywhere (and don't really care about the dentries to
> > > start with).
> > > 
> > > But why do we care about keeping the inodes with errors in memory
> > > here, but not elsewhere?
> > 
> > We actually, do, but it's not obvious...
> > 
> > > Maybe this can be explained in an expanded comment.
> > 
> > ...because this bit here is basically the same as xchk_irele, but we
> > don't have a xfs_scrub object to pass in, so it's opencoded.  I could
> > pull this logic out into:
> 
> Eww, I hadn't seen xchk_irele before.  To me it looks like
> I_DONTCACHE/d_mark_dontcache is really the wrong vehicle here.
> 
> I'd instead have a XFS_IGET_SCRUB, which will set an XFS_ISCRUB or
> whatever flag on a cache miss.  Any cache hit without XFS_IGET_SCRUB
> will clear it.
> 
> ->drop_inode then always returns true for XFS_ISCRUB inodes unless
> in a transaction.

How does it determine that we're in a transaction?  We just stopped
storing transactions in current->journal_info due to problems with
nested transactions and ext4 assuming that it can blind deref that.

>                    Talking about the in transaction part - why do
> we drop inodes in the transaction in scrub, but not elsewhere?

One example is:

Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to
find rmap records -> iget/irele.

--D
Christoph Hellwig April 11, 2024, 4:52 a.m. UTC | #5
> How does it determine that we're in a transaction?  We just stopped
> storing transactions in current->journal_info due to problems with
> nested transactions and ext4 assuming that it can blind deref that.

Oh, I was looking at an old tree and missed that.

Well, someone needs to own it, it's just not just ext4 but could us.
> 
> >                    Talking about the in transaction part - why do
> > we drop inodes in the transaction in scrub, but not elsewhere?
> 
> One example is:
> 
> Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to
> find rmap records -> iget/irele.

So this is just the magic empty transaction?
Darrick J. Wong April 11, 2024, 4:56 a.m. UTC | #6
On Wed, Apr 10, 2024 at 09:52:41PM -0700, Christoph Hellwig wrote:
> > How does it determine that we're in a transaction?  We just stopped
> > storing transactions in current->journal_info due to problems with
> > nested transactions and ext4 assuming that it can blind deref that.
> 
> Oh, I was looking at an old tree and missed that.

It's not in my tree but I did ... oh crap that already got committed; I
need to rip out that part of xrep_trans_alloc_hook_dummy now.

> Well, someone needs to own it, it's just not just ext4 but could us.

Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> > >                    Talking about the in transaction part - why do
> > > we drop inodes in the transaction in scrub, but not elsewhere?
> > 
> > One example is:
> > 
> > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to
> > find rmap records -> iget/irele.
> 
> So this is just the magic empty transaction?

No, that's the fully featured repair transaction that will eventually be
used to write/commit the new rmap tree.

--D
Christoph Hellwig April 11, 2024, 5:02 a.m. UTC | #7
On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote:
> > Well, someone needs to own it, it's just not just ext4 but could us.
> 
> Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If we set current->journal and take a page faul we could not just
recurse into ext4 but into any fs including XFS.  Any everyone
blindly dereferences is as only one fs can own it.

> > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to
> > > find rmap records -> iget/irele.
> > 
> > So this is just the magic empty transaction?
> 
> No, that's the fully featured repair transaction that will eventually be
> used to write/commit the new rmap tree.

That seems a bit dangerous to me.  I guess we rely on the code inside
the transaction context to never race with unmount as lack of SB_ACTIVE
will make the VFS ignore the dontcache flag.
Darrick J. Wong April 11, 2024, 5:21 a.m. UTC | #8
On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote:
> > > Well, someone needs to own it, it's just not just ext4 but could us.
> > 
> > Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> If we set current->journal and take a page faul we could not just
> recurse into ext4 but into any fs including XFS.  Any everyone
> blindly dereferences is as only one fs can own it.

Well back before we ripped it out I had said that XFS should just set
current->journal to 1 to prevent memory corruption but then Jan Kara
noted that ext4 changes its behavior wrt jbd2 if it sees nonzero
current->journal.  That's why Dave dropped it entirely.

> > > > Alloc transaction -> lock rmap btree for repairs -> iscan filesystem to
> > > > find rmap records -> iget/irele.
> > > 
> > > So this is just the magic empty transaction?
> > 
> > No, that's the fully featured repair transaction that will eventually be
> > used to write/commit the new rmap tree.
> 
> That seems a bit dangerous to me.  I guess we rely on the code inside
> the transaction context to never race with unmount as lack of SB_ACTIVE
> will make the VFS ignore the dontcache flag.

That and we have an open fd to call the ioctl so any unmount will fail,
and we can't enter scrub if unmount already starte.

--D
Christoph Hellwig April 11, 2024, 2:02 p.m. UTC | #9
On Wed, Apr 10, 2024 at 10:21:07PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote:
> > > > Well, someone needs to own it, it's just not just ext4 but could us.
> > > 
> > > Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > If we set current->journal and take a page faul we could not just
> > recurse into ext4 but into any fs including XFS.  Any everyone
> > blindly dereferences is as only one fs can own it.
> 
> Well back before we ripped it out I had said that XFS should just set
> current->journal to 1 to prevent memory corruption but then Jan Kara
> noted that ext4 changes its behavior wrt jbd2 if it sees nonzero
> current->journal.  That's why Dave dropped it entirely.

If you are in a fs context you own current->journal_info.  But you
also must make sure to not copy from and especially to user to not
recurse into another file system.  A per-thread field can't work any
other way.  So what ext4 is doing here is perfectly fine.  What XFS
did was to set current->journal_info and then cause page faults, which
is not ok.  I'm glad we fixed it.

> > That seems a bit dangerous to me.  I guess we rely on the code inside
> > the transaction context to never race with unmount as lack of SB_ACTIVE
> > will make the VFS ignore the dontcache flag.
> 
> That and we have an open fd to call the ioctl so any unmount will fail,
> and we can't enter scrub if unmount already starte.

Indeed.

So I'm still confused on why this new code keeps the inode around if an
error happend, but xchk_irele does not.  What is the benefit of keeping
the inode around here?  Why des it not apply to xchk_irele?

I also don't understand how d_mark_dontcache in
xfs_ioctl_setattr_prepare_dax is supposed to work.  It'll make the inode
go away quicker than without, but it can't force the inode by itself.

I'm also lot on the interaction of that with the scrub inodes due to
both above.  I'd still expect any scrub iget to set uncached for
a cache miss.  If we then need to keep the inode around in transaction
context we just keep it.  What is the benefit of playing racing
games with i_count to delay setting the dontcache flag until irele?
And why does the DAX mess matter for that?

Maybe I'm just thick and this is all obvious, but then it needs to
be documented in detailed comments.
Darrick J. Wong April 12, 2024, 12:21 a.m. UTC | #10
On Thu, Apr 11, 2024 at 07:02:07AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 10, 2024 at 10:21:07PM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote:
> > > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote:
> > > > > Well, someone needs to own it, it's just not just ext4 but could us.
> > > > 
> > > > Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > If we set current->journal and take a page faul we could not just
> > > recurse into ext4 but into any fs including XFS.  Any everyone
> > > blindly dereferences is as only one fs can own it.
> > 
> > Well back before we ripped it out I had said that XFS should just set
> > current->journal to 1 to prevent memory corruption but then Jan Kara
> > noted that ext4 changes its behavior wrt jbd2 if it sees nonzero
> > current->journal.  That's why Dave dropped it entirely.
> 
> If you are in a fs context you own current->journal_info.  But you
> also must make sure to not copy from and especially to user to not
> recurse into another file system.  A per-thread field can't work any
> other way.  So what ext4 is doing here is perfectly fine.  What XFS
> did was to set current->journal_info and then cause page faults, which
> is not ok.  I'm glad we fixed it.
> 
> > > That seems a bit dangerous to me.  I guess we rely on the code inside
> > > the transaction context to never race with unmount as lack of SB_ACTIVE
> > > will make the VFS ignore the dontcache flag.
> > 
> > That and we have an open fd to call the ioctl so any unmount will fail,
> > and we can't enter scrub if unmount already starte.
> 
> Indeed.
> 
> So I'm still confused on why this new code keeps the inode around if an
> error happend, but xchk_irele does not.  What is the benefit of keeping
> the inode around here?  Why des it not apply to xchk_irele?

OH!  Crap, I forgot that some years ago (after the creation of the
vectorized scrub patch) I cleaned up that behavior -- previously scrub
actually did play games with clearing dontcache if the inode was sick.

Then Dave pointed out that we could just change reclaim not to purge the
incore inode (and hence preserve the health state) until unmount or the
fs goes down, and clear I_DONTCACHE any time we notice bad metadata.
Hopefully the incore inode then survives long enough that anyone
scanning for filesystem health status will still see the badness state.

Therefore, we don't need the set_dontcache variable in this patch:

	/*
	 * If we're holding the only reference to an inode opened via
	 * handle, mark it dontcache so that we don't pollute the cache.
	 */
	if (handle_ip) {
		if (atomic_read(&VFS_I(handle_ip)->i_count) == 1)
			d_mark_dontcache(VFS_I(handle_ip));
		xfs_irele(handle_ip);
	}

> I also don't understand how d_mark_dontcache in
> xfs_ioctl_setattr_prepare_dax is supposed to work.  It'll make the inode
> go away quicker than without, but it can't force the inode by itself.

That's correct.  You can set the ondisk fsdax iflag and then wait
centuries for the incore fsdax to catch up.  I think this is a very
marginal design, but thankfully the intended design is that you set
daxinherit on the parent dir or mount with dax=always and all new files
just come up with both fsdax flags set.

> I'm also lot on the interaction of that with the scrub inodes due to
> both above.  I'd still expect any scrub iget to set uncached for
> a cache miss.  If we then need to keep the inode around in transaction
> context we just keep it.  What is the benefit of playing racing
> games with i_count to delay setting the dontcache flag until irele?

One thing I don't like about XFS_IGET_DONTCACHE is that a concurrent
iget call without DONTCACHE won't clear the state from the inode, which
can lead to unnecessary evictions.  This racy thing means we only set it
if we think the inode isn't in use anywhere else.

At this point, though, I think we could add XFS_IGET_DONTCACHE to all
the iget calls and drop the irele dontcache thing.

> And why does the DAX mess matter for that?

fsdax doesn't matter, it merely slurped up the functionality and now
we're tangled up in it.

> Maybe I'm just thick and this is all obvious, but then it needs to
> be documented in detailed comments.

No, it's ... very twisty and weird.  I wish dontcache had remained our
private xfs thing.

--D
diff mbox series

Patch

diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index cab34823f3c91..f1a17f986b6cf 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -761,6 +761,31 @@  xfs_scrubv_previous_failures(
 	return false;
 }
 
+/*
+ * If the caller provided us with a nonzero inode number that isn't the ioctl
+ * file, try to grab a reference to it to eliminate all further untrusted inode
+ * lookups.  If we can't get the inode, let each scrub function try again.
+ */
+STATIC struct xfs_inode *
+xchk_scrubv_open_by_handle(
+	struct xfs_mount		*mp,
+	const struct xfs_scrub_vec_head	*vhead)
+{
+	struct xfs_inode		*ip;
+	int				error;
+
+	error = xfs_iget(mp, NULL, vhead->svh_ino, XFS_IGET_UNTRUSTED, 0, &ip);
+	if (error)
+		return NULL;
+
+	if (VFS_I(ip)->i_generation != vhead->svh_gen) {
+		xfs_irele(ip);
+		return NULL;
+	}
+
+	return ip;
+}
+
 /* Vectored scrub implementation to reduce ioctl calls. */
 int
 xfs_scrubv_metadata(
@@ -769,7 +794,9 @@  xfs_scrubv_metadata(
 {
 	struct xfs_inode		*ip_in = XFS_I(file_inode(file));
 	struct xfs_mount		*mp = ip_in->i_mount;
+	struct xfs_inode		*handle_ip = NULL;
 	struct xfs_scrub_vec		*v;
+	bool				set_dontcache = false;
 	unsigned int			i;
 	int				error = 0;
 
@@ -788,9 +815,28 @@  xfs_scrubv_metadata(
 		    (v->sv_flags & ~XFS_SCRUB_FLAGS_OUT))
 			return -EINVAL;
 
+		/*
+		 * If we detect at least one inode-type scrub, we might
+		 * consider setting dontcache at the end.
+		 */
+		if (v->sv_type < XFS_SCRUB_TYPE_NR &&
+		    meta_scrub_ops[v->sv_type].type == ST_INODE)
+			set_dontcache = true;
+
 		trace_xchk_scrubv_item(mp, vhead, v);
 	}
 
+	/*
+	 * If the caller wants us to do a scrub-by-handle and the file used to
+	 * call the ioctl is not the same file, load the incore inode and pin
+	 * it across all the scrubv actions to avoid repeated UNTRUSTED
+	 * lookups.  The reference is not passed to deeper layers of scrub
+	 * because each scrubber gets to decide its own strategy for getting an
+	 * inode.
+	 */
+	if (vhead->svh_ino && vhead->svh_ino != ip_in->i_ino)
+		handle_ip = xchk_scrubv_open_by_handle(mp, vhead);
+
 	/* Run all the scrubbers. */
 	for (i = 0, v = vhead->svh_vecs; i < vhead->svh_nr; i++, v++) {
 		struct xfs_scrub_metadata	sm = {
@@ -814,6 +860,10 @@  xfs_scrubv_metadata(
 		v->sv_ret = xfs_scrub_metadata(file, &sm);
 		v->sv_flags = sm.sm_flags;
 
+		/* Leave the inode in memory if something's wrong with it. */
+		if (xchk_needs_repair(&sm))
+			set_dontcache = false;
+
 		if (vhead->svh_rest_us) {
 			ktime_t		expires;
 
@@ -828,5 +878,16 @@  xfs_scrubv_metadata(
 		}
 	}
 
+	/*
+	 * If we're holding the only reference to an inode opened via handle
+	 * and the scan was clean, mark it dontcache so that we don't pollute
+	 * the cache.
+	 */
+	if (handle_ip) {
+		if (set_dontcache &&
+		    atomic_read(&VFS_I(handle_ip)->i_count) == 1)
+			d_mark_dontcache(VFS_I(handle_ip));
+		xfs_irele(handle_ip);
+	}
 	return error;
 }