diff mbox series

xfs: only reset incore inode health state flags when reclaiming an inode

Message ID 20210320164007.GX22100@magnolia (mailing list archive)
State Superseded
Headers show
Series xfs: only reset incore inode health state flags when reclaiming an inode | expand

Commit Message

Darrick J. Wong March 20, 2021, 4:40 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

While running some fuzz tests on inode metadata, I noticed that the
filesystem health report (as provided by xfs_spaceman) failed to report
the file corruption even when spaceman was run immediately after running
xfs_scrub to detect the corruption.  That isn't the intended behavior;
one ought to be able to run scrub to detect errors in the ondisk
metadata and be able to access to those reports for some time after the
scrub.

After running the same sequence through an instrumented kernel, I
discovered the reason why -- scrub igets the file, scans it, marks it
sick, and ireleases the inode.  When the VFS lets go of the incore
inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
inode before it moves to RECLAIM state, iget reinitializes the VFS
state, clears the sick and checked masks, and hands back the inode.  At
this point, the caller has the exact same incore inode, but with all the
health state erased.

In other words, we're erasing the incore inode's health state flags when
we've decided NOT to sever the link between the incore inode and the
ondisk inode.  This is wrong, so we need to remove the lines that zero
the fields from xfs_iget_cache_hit.

As a precaution, we add the same lines into xfs_reclaim_inode just after
we sever the link between incore and ondisk inode.  Strictly speaking
this isn't necessary because once an inode has gone through reclaim it
must go through xfs_inode_alloc (which also zeroes the state) and
xfs_iget is careful to check for mismatches between the inode it pulls
out of the radix tree and the one it wants.

Fixes: 6772c1f11206 ("xfs: track metadata health status")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Brian Foster March 22, 2021, 3:07 p.m. UTC | #1
On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 

So this requires some degree of cache pressure to reproduce, right?
I.e., the inode likely does not immediately go to reclaimable state on
release, but rather the vfs eventually decides to evict and we
inactivate from there. If we grab the inode after that point, it
effectively behaves as if the inode structure was freed and we re-read
from disk because we cleared health state earlier than necessary.

If I'm following that correctly, do you observe a noticeable impact in
terms of health state lifetime? The change seems reasonable, I'm just
wondering how much longer we'd expect to have this information available
after vfs eviction occurs and if/how that impacts userspace scrub
behavior.

Brian

> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 595bda69b18d..5325fa28d099 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
>  		ip->i_flags |= XFS_INEW;
>  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
>  		inode->i_state = I_NEW;
> -		ip->i_sick = 0;
> -		ip->i_checked = 0;
>  
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
> @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>
Darrick J. Wong March 22, 2021, 4:29 p.m. UTC | #2
On Mon, Mar 22, 2021 at 11:07:09AM -0400, Brian Foster wrote:
> On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> 
> So this requires some degree of cache pressure to reproduce, right?

None at all, actually -- xfs_scrub uses bulkstat to find inodes to
scrub.  Bulkstat sets I_DONTCACHE on the files it reads, so the VFS
drops the inode (if there are no dentries or the dentries are also
marked DONTCACHE) as soon as irele drops the refcount to zero.

IOWs, the fuzz test can do this on an idle system:

# xfs_db -x -c 'inode XXX' -c 'fuzz -d core.blah random' /dev/sda
# mount /dev/sda /mnt
# xfs_scrub -d -T -v -n /mnt/
Corruption: inode XXX record
# xfs_spaceman -c 'health -c' /mnt/

and spaceman doesn't report the corruption that we /just/ found.

> I.e., the inode likely does not immediately go to reclaimable state on
> release, but rather the vfs eventually decides to evict and we
> inactivate from there. If we grab the inode after that point, it
> effectively behaves as if the inode structure was freed and we re-read
> from disk because we cleared health state earlier than necessary.

Right.

> If I'm following that correctly, do you observe a noticeable impact in
> terms of health state lifetime? The change seems reasonable, I'm just
> wondering how much longer we'd expect to have this information available
> after vfs eviction occurs and if/how that impacts userspace scrub
> behavior.

I notice /some/ increase in health state lifetime.  With the patch
applied the shell code above works as intended (spaceman reports inode
XXX is broken) so long as inode reclaim doesn't run before spaceman.
Because these days reclaim is triggered by the AIL releasing the inodes,
this means that spaceman only has to be run before the next fsync or the
log force expiration.

Similarly, if a different user program open()s the inode after a scan,
then the health state will stick around as long as there hasn't been
enough memory pressure to push the inode out of memory for real.

My development tree also contains a patchset that adds the ability for
inode reclaim to notice that it's freeing an inode that had health
problems, and record in the per-AG health state that some inode
somewhere in that AG had a problem, but that's for a future submission.
This will close the hole that inode health state perishes rather quickly
under memory pressure.

--D

> Brian
> 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 595bda69b18d..5325fa28d099 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
> >  		ip->i_flags |= XFS_INEW;
> >  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
> >  		inode->i_state = I_NEW;
> > -		ip->i_sick = 0;
> > -		ip->i_checked = 0;
> >  
> >  		spin_unlock(&ip->i_flags_lock);
> >  		spin_unlock(&pag->pag_ici_lock);
> > @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> > +	ip->i_sick = 0;
> > +	ip->i_checked = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
>
Brian Foster March 22, 2021, 5:47 p.m. UTC | #3
On Mon, Mar 22, 2021 at 09:29:44AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 22, 2021 at 11:07:09AM -0400, Brian Foster wrote:
> > On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > While running some fuzz tests on inode metadata, I noticed that the
> > > filesystem health report (as provided by xfs_spaceman) failed to report
> > > the file corruption even when spaceman was run immediately after running
> > > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > > one ought to be able to run scrub to detect errors in the ondisk
> > > metadata and be able to access to those reports for some time after the
> > > scrub.
> > > 
> > > After running the same sequence through an instrumented kernel, I
> > > discovered the reason why -- scrub igets the file, scans it, marks it
> > > sick, and ireleases the inode.  When the VFS lets go of the incore
> > > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > > state, clears the sick and checked masks, and hands back the inode.  At
> > > this point, the caller has the exact same incore inode, but with all the
> > > health state erased.
> > > 
> > 
> > So this requires some degree of cache pressure to reproduce, right?
> 
> None at all, actually -- xfs_scrub uses bulkstat to find inodes to
> scrub.  Bulkstat sets I_DONTCACHE on the files it reads, so the VFS
> drops the inode (if there are no dentries or the dentries are also
> marked DONTCACHE) as soon as irele drops the refcount to zero.
> 

Ah, I see. So in the DONTCACHE case, the inode release translates to
immediate eviction and thus immediate reclaimable status.

> IOWs, the fuzz test can do this on an idle system:
> 
> # xfs_db -x -c 'inode XXX' -c 'fuzz -d core.blah random' /dev/sda
> # mount /dev/sda /mnt
> # xfs_scrub -d -T -v -n /mnt/
> Corruption: inode XXX record
> # xfs_spaceman -c 'health -c' /mnt/
> 
> and spaceman doesn't report the corruption that we /just/ found.
> 
> > I.e., the inode likely does not immediately go to reclaimable state on
> > release, but rather the vfs eventually decides to evict and we
> > inactivate from there. If we grab the inode after that point, it
> > effectively behaves as if the inode structure was freed and we re-read
> > from disk because we cleared health state earlier than necessary.
> 
> Right.
> 
> > If I'm following that correctly, do you observe a noticeable impact in
> > terms of health state lifetime? The change seems reasonable, I'm just
> > wondering how much longer we'd expect to have this information available
> > after vfs eviction occurs and if/how that impacts userspace scrub
> > behavior.
> 
> I notice /some/ increase in health state lifetime.  With the patch
> applied the shell code above works as intended (spaceman reports inode
> XXX is broken) so long as inode reclaim doesn't run before spaceman.
> Because these days reclaim is triggered by the AIL releasing the inodes,
> this means that spaceman only has to be run before the next fsync or the
> log force expiration.
> 

Not sure what you mean by AIL releasing inodes associated with a scrub
scan.. does scrub dirty unhealthy inodes?

Regardless, it seems like we're still potentially racing with an actual
reclaim and thus loss of health state is still technically possible,
just made noticeably less likely in certain cases by this patch (due to
I_DONTCACHE)...

> Similarly, if a different user program open()s the inode after a scan,
> then the health state will stick around as long as there hasn't been
> enough memory pressure to push the inode out of memory for real.
> 
> My development tree also contains a patchset that adds the ability for
> inode reclaim to notice that it's freeing an inode that had health
> problems, and record in the per-AG health state that some inode
> somewhere in that AG had a problem, but that's for a future submission.
> This will close the hole that inode health state perishes rather quickly
> under memory pressure.
> 

... and presumably eventually addressed by perag tracking.

That all seems reasonable to me. Thanks for the details:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> --D
> 
> > Brian
> > 
> > > In other words, we're erasing the incore inode's health state flags when
> > > we've decided NOT to sever the link between the incore inode and the
> > > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > > the fields from xfs_iget_cache_hit.
> > > 
> > > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > > we sever the link between incore and ondisk inode.  Strictly speaking
> > > this isn't necessary because once an inode has gone through reclaim it
> > > must go through xfs_inode_alloc (which also zeroes the state) and
> > > xfs_iget is careful to check for mismatches between the inode it pulls
> > > out of the radix tree and the one it wants.
> > > 
> > > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_icache.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 595bda69b18d..5325fa28d099 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
> > >  		ip->i_flags |= XFS_INEW;
> > >  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
> > >  		inode->i_state = I_NEW;
> > > -		ip->i_sick = 0;
> > > -		ip->i_checked = 0;
> > >  
> > >  		spin_unlock(&ip->i_flags_lock);
> > >  		spin_unlock(&pag->pag_ici_lock);
> > > @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags = XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > > +	ip->i_sick = 0;
> > > +	ip->i_checked = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > 
> > 
>
Dave Chinner March 22, 2021, 9:30 p.m. UTC | #4
On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 595bda69b18d..5325fa28d099 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
>  		ip->i_flags |= XFS_INEW;
>  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
>  		inode->i_state = I_NEW;
> -		ip->i_sick = 0;
> -		ip->i_checked = 0;
>  
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
> @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);

This is only going to keep the health information around on a
DONTCACHE inode for a few extra seconds. If the scrub takes longer
to run than it takes for the background inode reclaimer thread to
run again (every 5s by default), then the health information for
that inode is still trashed by this patch and the problem still
exists.

I suspect that unhealthy inodes need to have IDONTCACHE cleared so
that they don't get reclaimed until there is memory pressure, hence
giving scrub/spaceman some time to set/report health issues. Perhaps
we should not reclaim the unhealthy inodes until they've been marked
as "seen"....

Cheers,

Dave.
Darrick J. Wong March 22, 2021, 10:13 p.m. UTC | #5
On Tue, Mar 23, 2021 at 08:30:16AM +1100, Dave Chinner wrote:
> On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 595bda69b18d..5325fa28d099 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
> >  		ip->i_flags |= XFS_INEW;
> >  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
> >  		inode->i_state = I_NEW;
> > -		ip->i_sick = 0;
> > -		ip->i_checked = 0;
> >  
> >  		spin_unlock(&ip->i_flags_lock);
> >  		spin_unlock(&pag->pag_ici_lock);
> > @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> > +	ip->i_sick = 0;
> > +	ip->i_checked = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 
> This is only going to keep the health information around on a
> DONTCACHE inode for a few extra seconds. If the scrub takes longer
> to run than it takes for the background inode reclaimer thread to
> run again (every 5s by default), then the health information for
> that inode is still trashed by this patch and the problem still
> exists.
> 
> I suspect that unhealthy inodes need to have IDONTCACHE cleared so
> that they don't get reclaimed until there is memory pressure, hence
> giving scrub/spaceman some time to set/report health issues.

Yes, it seems reasonable to cancel DONTCACHE if you're marking an inode
sick, and for iget to ignore DONTCACHE if the inode is in memory and is
marked sick.  This also sounds like a second patch. :)

> Perhaps we should not reclaim the unhealthy inodes until they've been
> marked as "seen"....

I'm hesitant to pin an inode in memory if it's unhealthy, because that
seems like it could lead to further problems if a large number of inodes
get marked sick and memory reclaim can't free enough RAM to enable a
recovery action (like shutting down the fs and unmounting it).

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 22, 2021, 11:10 p.m. UTC | #6
On Mon, Mar 22, 2021 at 03:13:54PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 23, 2021 at 08:30:16AM +1100, Dave Chinner wrote:
> > On Sat, Mar 20, 2021 at 09:40:07AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > While running some fuzz tests on inode metadata, I noticed that the
> > > filesystem health report (as provided by xfs_spaceman) failed to report
> > > the file corruption even when spaceman was run immediately after running
> > > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > > one ought to be able to run scrub to detect errors in the ondisk
> > > metadata and be able to access to those reports for some time after the
> > > scrub.
> > > 
> > > After running the same sequence through an instrumented kernel, I
> > > discovered the reason why -- scrub igets the file, scans it, marks it
> > > sick, and ireleases the inode.  When the VFS lets go of the incore
> > > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > > state, clears the sick and checked masks, and hands back the inode.  At
> > > this point, the caller has the exact same incore inode, but with all the
> > > health state erased.
> > > 
> > > In other words, we're erasing the incore inode's health state flags when
> > > we've decided NOT to sever the link between the incore inode and the
> > > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > > the fields from xfs_iget_cache_hit.
> > > 
> > > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > > we sever the link between incore and ondisk inode.  Strictly speaking
> > > this isn't necessary because once an inode has gone through reclaim it
> > > must go through xfs_inode_alloc (which also zeroes the state) and
> > > xfs_iget is careful to check for mismatches between the inode it pulls
> > > out of the radix tree and the one it wants.
> > > 
> > > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_icache.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 595bda69b18d..5325fa28d099 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -587,8 +587,6 @@ xfs_iget_cache_hit(
> > >  		ip->i_flags |= XFS_INEW;
> > >  		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
> > >  		inode->i_state = I_NEW;
> > > -		ip->i_sick = 0;
> > > -		ip->i_checked = 0;
> > >  
> > >  		spin_unlock(&ip->i_flags_lock);
> > >  		spin_unlock(&pag->pag_ici_lock);
> > > @@ -1205,6 +1203,8 @@ xfs_reclaim_inode(
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags = XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > > +	ip->i_sick = 0;
> > > +	ip->i_checked = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > This is only going to keep the health information around on a
> > DONTCACHE inode for a few extra seconds. If the scrub takes longer
> > to run than it takes for the background inode reclaimer thread to
> > run again (every 5s by default), then the health information for
> > that inode is still trashed by this patch and the problem still
> > exists.
> > 
> > I suspect that unhealthy inodes need to have IDONTCACHE cleared so
> > that they don't get reclaimed until there is memory pressure, hence
> > giving scrub/spaceman some time to set/report health issues.
> 
> Yes, it seems reasonable to cancel DONTCACHE if you're marking an inode
> sick, and for iget to ignore DONTCACHE if the inode is in memory and is
> marked sick.  This also sounds like a second patch. :)

Yup, sounds fine to me.

> > Perhaps we should not reclaim the unhealthy inodes until they've been
> > marked as "seen"....
> 
> I'm hesitant to pin an inode in memory if it's unhealthy, because that
> seems like it could lead to further problems if a large number of inodes
> get marked sick and memory reclaim can't free enough RAM to enable a
> recovery action (like shutting down the fs and unmounting it).

This is a solvable problem because we can tell the difference
between background inode reclaim and memory pressure triggered inode
reclaim. firstly, memory pressure will release the last VFS
reference to the inode, so it's not seem by background reclaim until
memory reclaim has released it from the VFS. Secondly, we have
different entry points into inode reclaim from the shrinker vs
backgroun reclaim so we can have different behaviour for the two if
we need to.

I suspect that having the VFS LRUs hold on to sick inodes until
memory pressure occurs is more than sufficient to avoid most
"reclaimed before reported" events. IF we are in a "lots of
corruption" or online repair scenario, then the admin already knows
that bad stuff is going down and verbosely reporting all the sick
inodes isn't really necessary...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 595bda69b18d..5325fa28d099 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -587,8 +587,6 @@  xfs_iget_cache_hit(
 		ip->i_flags |= XFS_INEW;
 		xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
 		inode->i_state = I_NEW;
-		ip->i_sick = 0;
-		ip->i_checked = 0;
 
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
@@ -1205,6 +1203,8 @@  xfs_reclaim_inode(
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+	ip->i_sick = 0;
+	ip->i_checked = 0;
 	spin_unlock(&ip->i_flags_lock);
 
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);