diff mbox series

[1/3] xfs: only reset incore inode health state flags when reclaiming an inode

Message ID 162268996135.2724138.14276025100886638786.stgit@locust (mailing list archive)
State Superseded
Headers show
Series xfs: preserve inode health reports for longer | expand

Commit Message

Darrick J. Wong June 3, 2021, 3:12 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

While running some fuzz tests on inode metadata, I noticed that the
filesystem health report (as provided by xfs_spaceman) failed to report
the file corruption even when spaceman was run immediately after running
xfs_scrub to detect the corruption.  That isn't the intended behavior;
one ought to be able to run scrub to detect errors in the ondisk
metadata and be able to access to those reports for some time after the
scrub.

After running the same sequence through an instrumented kernel, I
discovered the reason why -- scrub igets the file, scans it, marks it
sick, and ireleases the inode.  When the VFS lets go of the incore
inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
inode before it moves to RECLAIM state, iget reinitializes the VFS
state, clears the sick and checked masks, and hands back the inode.  At
this point, the caller has the exact same incore inode, but with all the
health state erased.

In other words, we're erasing the incore inode's health state flags when
we've decided NOT to sever the link between the incore inode and the
ondisk inode.  This is wrong, so we need to remove the lines that zero
the fields from xfs_iget_cache_hit.

As a precaution, we add the same lines into xfs_reclaim_inode just after
we sever the link between incore and ondisk inode.  Strictly speaking
this isn't necessary because once an inode has gone through reclaim it
must go through xfs_inode_alloc (which also zeroes the state) and
xfs_iget is careful to check for mismatches between the inode it pulls
out of the radix tree and the one it wants.

Fixes: 6772c1f11206 ("xfs: track metadata health status")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Dave Chinner June 3, 2021, 4:21 a.m. UTC | #1
On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Though I do wonder how long such state will hang around, because
once the inode is IRECLAIMABLE and clean it is only a matter of
seconds before the background inode reclaimer will free it...

Cheers,

Dave.
Brian Foster June 3, 2021, 12:22 p.m. UTC | #2
On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

I think I reviewed this the last time around..

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_icache.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 396cc54ca03f..c3f912a9231b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -523,9 +523,6 @@ xfs_iget_cache_hit(
>  				XFS_INO_TO_AGINO(pag->pag_mount, ino),
>  				XFS_ICI_RECLAIM_TAG);
>  		inode->i_state = I_NEW;
> -		ip->i_sick = 0;
> -		ip->i_checked = 0;
> -
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
>  	} else {
> @@ -979,6 +976,8 @@ xfs_reclaim_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>
Darrick J. Wong June 3, 2021, 8:41 p.m. UTC | #3
On Thu, Jun 03, 2021 at 02:21:07PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |    5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> Looks fine.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> Though I do wonder how long such state will hang around, because
> once the inode is IRECLAIMABLE and clean it is only a matter of
> seconds before the background inode reclaimer will free it...

A future patchset will add the ability to make the per-AG health status
remember that we were forced to reclaim a sick inode:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting

This series used to be the "fixes first" part of that code, but since it
directly intersects with the deferred inactivation changes, I moved it
up to try to fix the problem sooner than later.

Anyway, thanks for the review.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong June 3, 2021, 8:41 p.m. UTC | #4
On Thu, Jun 03, 2021 at 08:22:10AM -0400, Brian Foster wrote:
> On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> 
> I think I reviewed this the last time around..

Oops, yes, my bad. :(

--D

> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/xfs_icache.c |    5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 396cc54ca03f..c3f912a9231b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -523,9 +523,6 @@ xfs_iget_cache_hit(
> >  				XFS_INO_TO_AGINO(pag->pag_mount, ino),
> >  				XFS_ICI_RECLAIM_TAG);
> >  		inode->i_state = I_NEW;
> > -		ip->i_sick = 0;
> > -		ip->i_checked = 0;
> > -
> >  		spin_unlock(&ip->i_flags_lock);
> >  		spin_unlock(&pag->pag_ici_lock);
> >  	} else {
> > @@ -979,6 +976,8 @@ xfs_reclaim_inode(
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> > +	ip->i_sick = 0;
> > +	ip->i_checked = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 396cc54ca03f..c3f912a9231b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -523,9 +523,6 @@  xfs_iget_cache_hit(
 				XFS_INO_TO_AGINO(pag->pag_mount, ino),
 				XFS_ICI_RECLAIM_TAG);
 		inode->i_state = I_NEW;
-		ip->i_sick = 0;
-		ip->i_checked = 0;
-
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
 	} else {
@@ -979,6 +976,8 @@  xfs_reclaim_inode(
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+	ip->i_sick = 0;
+	ip->i_checked = 0;
 	spin_unlock(&ip->i_flags_lock);
 
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);