diff mbox series

[1/3] xfs: manage inode DONTCACHE status at irele time

Message ID 166473482943.1084685.12751834399982118437.stgit@magnolia (mailing list archive)
State Superseded, archived
Headers show
Series xfs: fix iget/irele usage in online fsck | expand

Commit Message

Darrick J. Wong Oct. 2, 2022, 6:20 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   52 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/scrub/common.h |    3 +++
 fs/xfs/scrub/dir.c    |    2 +-
 fs/xfs/scrub/parent.c |    5 ++---
 fs/xfs/scrub/scrub.c  |    2 +-
 5 files changed, 55 insertions(+), 9 deletions(-)

Comments

Dave Chinner Nov. 15, 2022, 3:13 a.m. UTC | #1
On Sun, Oct 02, 2022 at 11:20:29AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Right now, there are statements scattered all over the online fsck
> codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
> about scrub's unusual practice of releasing inodes with transactions
> held.
> 
> However, iget is the wrong place to handle this -- the DONTCACHE state
> doesn't matter at all until we try to *release* the inode, and here we
> get things wrong in multiple ways:
> 
> First, if we /do/ have a transaction, we must NOT drop the inode,
> because the inode could have dirty pages, dropping the inode will
> trigger writeback, and writeback can trigger a nested transaction.
> 
> Second, if the inode already had an active reference and the DONTCACHE
> flag set, the icache hit when scrub grabs another ref will not clear
> DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
> initiate cache drops so that sysadmins can change a file's access mode
> between pagecache and DAX.
> 
> Third, if we do actually have the last active reference to the inode, we
> can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
> where we actually want that flag.
> 
> Create an xchk_irele helper to encode all that logic and switch the
> online fsck code to use it.  Since this now means that nearly all
> scrubbers use the same xfs_iget flags, we can wrap them too.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Ok, I can see what needs to be done here. It seems a bit fragile,
but I don't see a better way at the moment.

That said...

> diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> index ab182a5cd0c0..38ea04e66468 100644
> --- a/fs/xfs/scrub/parent.c
> +++ b/fs/xfs/scrub/parent.c
> @@ -131,7 +131,6 @@ xchk_parent_validate(
>  	xfs_ino_t		dnum,
>  	bool			*try_again)
>  {
> -	struct xfs_mount	*mp = sc->mp;
>  	struct xfs_inode	*dp = NULL;
>  	xfs_nlink_t		expected_nlink;
>  	xfs_nlink_t		nlink;
> @@ -168,7 +167,7 @@ xchk_parent_validate(
>  	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
>  	 *  cross referencing error.  Any other error is an operational error.
>  	 */
> -	error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp);
> +	error = xchk_iget(sc, dnum, &dp);
>  	if (error == -EINVAL || error == -ENOENT) {
>  		error = -EFSCORRUPTED;
>  		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
> @@ -253,7 +252,7 @@ xchk_parent_validate(
>  out_unlock:
>  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
>  out_rele:
> -	xfs_irele(dp);
> +	xchk_irele(sc, dp);
>  out:
>  	return error;
>  }

Didn't you miss a couple of cases here? THe current upstream code
looks like:

.......
237         /* Drat, parent changed.  Try again! */
238         if (dnum != dp->i_ino) {
239                 xfs_irele(dp);
240                 *try_again = true;
241                 return 0;
242         }
243         xfs_irele(dp);
244
245         /*
246          * '..' didn't change, so check that there was only one entry
247          * for us in the parent.
248          */
249         if (nlink != expected_nlink)
250                 xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
251         return error;
252
253 out_unlock:
254         xfs_iunlock(dp, XFS_IOLOCK_SHARED);
255 out_rele:
256         xfs_irele(dp);
257 out:
258         return error;
259 }

So it looks like you missed the conversion at lines 239 and 243. Of
course, these may have been removed in a prior patchset I've looked
at and forgotten about, but on the surface this looks like missed
conversions.

Cheers,

Dave.
Darrick J. Wong Nov. 15, 2022, 3:34 a.m. UTC | #2
On Tue, Nov 15, 2022 at 02:13:18PM +1100, Dave Chinner wrote:
> On Sun, Oct 02, 2022 at 11:20:29AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Right now, there are statements scattered all over the online fsck
> > codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
> > about scrub's unusual practice of releasing inodes with transactions
> > held.
> > 
> > However, iget is the wrong place to handle this -- the DONTCACHE state
> > doesn't matter at all until we try to *release* the inode, and here we
> > get things wrong in multiple ways:
> > 
> > First, if we /do/ have a transaction, we must NOT drop the inode,
> > because the inode could have dirty pages, dropping the inode will
> > trigger writeback, and writeback can trigger a nested transaction.
> > 
> > Second, if the inode already had an active reference and the DONTCACHE
> > flag set, the icache hit when scrub grabs another ref will not clear
> > DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
> > initiate cache drops so that sysadmins can change a file's access mode
> > between pagecache and DAX.
> > 
> > Third, if we do actually have the last active reference to the inode, we
> > can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
> > where we actually want that flag.
> > 
> > Create an xchk_irele helper to encode all that logic and switch the
> > online fsck code to use it.  Since this now means that nearly all
> > scrubbers use the same xfs_iget flags, we can wrap them too.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Ok, I can see what needs to be done here. It seems a bit fragile,
> but I don't see a better way at the moment.
> 
> That said...
> 
> > diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
> > index ab182a5cd0c0..38ea04e66468 100644
> > --- a/fs/xfs/scrub/parent.c
> > +++ b/fs/xfs/scrub/parent.c
> > @@ -131,7 +131,6 @@ xchk_parent_validate(
> >  	xfs_ino_t		dnum,
> >  	bool			*try_again)
> >  {
> > -	struct xfs_mount	*mp = sc->mp;
> >  	struct xfs_inode	*dp = NULL;
> >  	xfs_nlink_t		expected_nlink;
> >  	xfs_nlink_t		nlink;
> > @@ -168,7 +167,7 @@ xchk_parent_validate(
> >  	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
> >  	 *  cross referencing error.  Any other error is an operational error.
> >  	 */
> > -	error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp);
> > +	error = xchk_iget(sc, dnum, &dp);
> >  	if (error == -EINVAL || error == -ENOENT) {
> >  		error = -EFSCORRUPTED;
> >  		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
> > @@ -253,7 +252,7 @@ xchk_parent_validate(
> >  out_unlock:
> >  	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> >  out_rele:
> > -	xfs_irele(dp);
> > +	xchk_irele(sc, dp);
> >  out:
> >  	return error;
> >  }
> 
> Didn't you miss a couple of cases here? THe current upstream code
> looks like:
> 
> .......
> 237         /* Drat, parent changed.  Try again! */
> 238         if (dnum != dp->i_ino) {
> 239                 xfs_irele(dp);
> 240                 *try_again = true;
> 241                 return 0;
> 242         }
> 243         xfs_irele(dp);
> 244
> 245         /*
> 246          * '..' didn't change, so check that there was only one entry
> 247          * for us in the parent.
> 248          */
> 249         if (nlink != expected_nlink)
> 250                 xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
> 251         return error;
> 252
> 253 out_unlock:
> 254         xfs_iunlock(dp, XFS_IOLOCK_SHARED);
> 255 out_rele:
> 256         xfs_irele(dp);
> 257 out:
> 258         return error;
> 259 }
> 
> So it looks like you missed the conversion at lines 239 and 243. Of
> course, these may have been removed in a prior patchset I've looked
> at and forgotten about, but on the surface this looks like missed
> conversions.

Actually, I probably missed it because one of the follow-on fixpatches
in the v23.1 patchbomb removes it entirely:
https://lore.kernel.org/linux-xfs/166473483278.1084804.14032671424392139245.stgit@magnolia/

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 73ac38c1126e..42a25488bd25 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -710,6 +710,16 @@  xchk_checkpoint_log(
 	return 0;
 }
 
+/* Verify that an inode is allocated ondisk, then return its cached inode. */
+int
+xchk_iget(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		inum,
+	struct xfs_inode	**ipp)
+{
+	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
+}
+
 /*
  * Given an inode and the scrub control structure, grab either the
  * inode referenced in the control structure or the inode passed in.
@@ -734,8 +744,7 @@  xchk_get_inode(
 	/* Look up the inode, see if the generation number matches. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
-	error = xfs_iget(mp, NULL, sc->sm->sm_ino,
-			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &ip);
+	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
 	switch (error) {
 	case -ENOENT:
 		/* Inode doesn't exist, just bail out. */
@@ -757,7 +766,7 @@  xchk_get_inode(
 		 * that it no longer exists.
 		 */
 		error = xfs_imap(sc->mp, sc->tp, sc->sm->sm_ino, &imap,
-				XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE);
+				XFS_IGET_UNTRUSTED);
 		if (error)
 			return -ENOENT;
 		error = -EFSCORRUPTED;
@@ -770,7 +779,7 @@  xchk_get_inode(
 		return error;
 	}
 	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
-		xfs_irele(ip);
+		xchk_irele(sc, ip);
 		return -ENOENT;
 	}
 
@@ -778,6 +787,41 @@  xchk_get_inode(
 	return 0;
 }
 
+/* Release an inode, possibly dropping it in the process. */
+void
+xchk_irele(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (current->journal_info != NULL) {
+		ASSERT(current->journal_info == sc->tp);
+
+		/*
+		 * If we are in a transaction, we /cannot/ drop the inode
+		 * ourselves, because the VFS will trigger writeback, which
+		 * can require a transaction.  Clear DONTCACHE to force the
+		 * inode to the LRU, where someone else can take care of
+		 * dropping it.
+		 *
+		 * Note that when we grabbed our reference to the inode, it
+		 * could have had an active ref and DONTCACHE set if a sysadmin
+		 * is trying to coerce a change in file access mode.  icache
+		 * hits do not clear DONTCACHE, so we must do it here.
+		 */
+		spin_lock(&VFS_I(ip)->i_lock);
+		VFS_I(ip)->i_state &= ~I_DONTCACHE;
+		spin_unlock(&VFS_I(ip)->i_lock);
+	} else if (atomic_read(&VFS_I(ip)->i_count) == 1) {
+		/*
+		 * If this is the last reference to the inode and the caller
+		 * permits it, set DONTCACHE to avoid thrashing.
+		 */
+		d_mark_dontcache(VFS_I(ip));
+	}
+
+	xfs_irele(ip);
+}
+
 /* Set us up to scrub a file's contents. */
 int
 xchk_setup_inode_contents(
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 0efe6b947d88..7472c41d9cfe 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -137,6 +137,9 @@  int xchk_get_inode(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
+int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
+void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
+
 /*
  * Don't bother cross-referencing if we already found corruption or cross
  * referencing discrepancies.
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 58b9761db48d..43a1cbf2ac67 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -86,7 +86,7 @@  xchk_dir_check_ftype(
 			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
 	if (ino_dtype != dtype)
 		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-	xfs_irele(ip);
+	xchk_irele(sdc->sc, ip);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index ab182a5cd0c0..38ea04e66468 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -131,7 +131,6 @@  xchk_parent_validate(
 	xfs_ino_t		dnum,
 	bool			*try_again)
 {
-	struct xfs_mount	*mp = sc->mp;
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
 	xfs_nlink_t		nlink;
@@ -168,7 +167,7 @@  xchk_parent_validate(
 	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp);
+	error = xchk_iget(sc, dnum, &dp);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
@@ -253,7 +252,7 @@  xchk_parent_validate(
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 out_rele:
-	xfs_irele(dp);
+	xchk_irele(sc, dp);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7a3557a69fe0..bc9638c7a379 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -181,7 +181,7 @@  xchk_teardown(
 			xfs_iunlock(sc->ip, sc->ilock_flags);
 		if (sc->ip != ip_in &&
 		    !xfs_internal_inum(sc->mp, sc->ip->i_ino))
-			xfs_irele(sc->ip);
+			xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)