diff mbox series

[1/2] xfs: fix handling of RCU freed inodes from other AGs in xfs_icwalk_ag

Message ID 20240812052352.3786445-2-hch@lst.de (mailing list archive)
State New
Headers show
Series [1/2] xfs: fix handling of RCU freed inodes from other AGs in xfs_icwalk_ag | expand

Commit Message

Christoph Hellwig Aug. 12, 2024, 5:23 a.m. UTC
When xfs_icwalk_ag skips an inode because it was RCU freed from another
AG, the slot for the inode in the batch array needs to be zeroed.  We
also really shouldn't try to grab the inode in that case (or at very
least undo the grab), so move the call to xfs_icwalk_ag after this sanity
check.

Fixes: 1a3e8f3da09c ("xfs: convert inode cache lookups to use RCU locking")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_icache.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Darrick J. Wong Aug. 12, 2024, 5:39 p.m. UTC | #1
On Mon, Aug 12, 2024 at 07:23:00AM +0200, Christoph Hellwig wrote:
> When xfs_icwalk_ag skips an inode because it was RCU freed from another
> AG, the slot for the inode in the batch array needs to be zeroed.  We
> also really shouldn't try to grab the inode in that case (or at very
> least undo the grab), so move the call to xfs_icwalk_ag after this sanity
> check.
> 
> Fixes: 1a3e8f3da09c ("xfs: convert inode cache lookups to use RCU locking")
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_icache.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index ae3c049fd3a216..3ee92d3d1770db 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1701,9 +1701,6 @@ xfs_icwalk_ag(
>  		for (i = 0; i < nr_found; i++) {
>  			struct xfs_inode *ip = batch[i];
>  
> -			if (done || !xfs_icwalk_igrab(goal, ip, icw))
> -				batch[i] = NULL;
> -
>  			/*
>  			 * Update the index for the next lookup. Catch
>  			 * overflows into the next AG range which can occur if
> @@ -1716,8 +1713,14 @@ xfs_icwalk_ag(
>  			 * us to see this inode, so another lookup from the
>  			 * same index will not find it again.
>  			 */
> -			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
> +			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno) {
> +				batch[i] = NULL;
>  				continue;
> +			}
> +
> +			if (done || !xfs_icwalk_igrab(goal, ip, icw))
> +				batch[i] = NULL;

IOWs, if @ip has been freed and reallocated to a different AG, then we
don't want to touch it at all, not even to check IRECLAIMABLE in igrab.
I think that sounds correct so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +
>  			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
>  			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
>  				done = true;
> -- 
> 2.43.0
> 
>
Dave Chinner Aug. 14, 2024, 9:50 p.m. UTC | #2
On Mon, Aug 12, 2024 at 07:23:00AM +0200, Christoph Hellwig wrote:
> When xfs_icwalk_ag skips an inode because it was RCU freed from another
> AG, the slot for the inode in the batch array needs to be zeroed.  We
> also really shouldn't try to grab the inode in that case (or at very
> least undo the grab), so move the call to xfs_icwalk_ag after this sanity
> check.

I think this change is invalid and needs to be reworked. It moves
the actual "has the inode been RCU freed in this grace period"
checks to after some other check on the inode is performed. From
that perspective, this is a bad change to make.

From another perspective, this is a poor change to make because....

> Fixes: 1a3e8f3da09c ("xfs: convert inode cache lookups to use RCU locking")
> Signed-off-by: Christoph Hellwig <hch@lst.de>

... that commit and the current code isn't actually broken.

In commit 1a3e8f3da09c, xfs_inode_ag_walk_grab() added explicit
checks against using RCU freed inodes, and that nulls out the inode
in the batch array:

       /*
        * check for stale RCU freed inode
        *
        * If the inode has been reallocated, it doesn't matter if it's not in
        * the AG we are walking - we are walking for writeback, so if it
        * passes all the "valid inode" checks and is dirty, then we'll write
        * it back anyway.  If it has been reallocated and still being
        * initialised, the XFS_INEW check below will catch it.
        */
       spin_lock(&ip->i_flags_lock);
       if (!ip->i_ino)
               goto out_unlock_noent;


Indeed, xfs_icwalk_igrab() -> xfs_blockgc_igrab() still has this
check:

	/* Check for stale RCU freed inode */
        spin_lock(&ip->i_flags_lock);
        if (!ip->i_ino)
                goto out_unlock_noent;

However, that commit used a different check for reclaimable inodes;
it uses XFS_IRECLAIM to indicate that the inode is either being
reclaimed or has been RCU freed and so reclaim should not touch it.
xfs_reclaim_igrab() still retains that check:

	spin_lock(&ip->i_flags_lock);
        if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
            __xfs_iflags_test(ip, XFS_IRECLAIM)) {
                /* not a reclaim candidate. */
                spin_unlock(&ip->i_flags_lock);
                return false;
        }

XFS_IRECLAIM is always set on RCU freed inodes (it's set at the same
time ip->i_ino is set to zero in xfs_reclaim_inode()) before the
inode is rcu freed, and so neither of these grab functions will
allow RCU freed inodes to be grabbed.

Hence:

> ---
>  fs/xfs/xfs_icache.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index ae3c049fd3a216..3ee92d3d1770db 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1701,9 +1701,6 @@ xfs_icwalk_ag(
>  		for (i = 0; i < nr_found; i++) {
>  			struct xfs_inode *ip = batch[i];
>  
> -			if (done || !xfs_icwalk_igrab(goal, ip, icw))
> -				batch[i] = NULL;
> -

The existing code here -always- catches RCU freed inodes that have
not yet been freed due to the grace period still running. This
code replaces RCU freed inodes with NULL in the batch
array, and so after this point the inode is guaranteed to be valid.

>  			/*
>  			 * Update the index for the next lookup. Catch
>  			 * overflows into the next AG range which can occur if
> @@ -1716,8 +1713,14 @@ xfs_icwalk_ag(
>  			 * us to see this inode, so another lookup from the
>  			 * same index will not find it again.
>  			 */
> -			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
> +			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno) {
> +				batch[i] = NULL;
>  				continue;

This check is not checking for RCU freed inodes - it was checking
for indoes that are both freed and reallocated within a single RCU
grace period.  I added this check in 1a3e8f3da09c (backin 2010)
because I wasn't 100% certain of the RCU existence guarantees and
those interacted with read side critical sections.  It was
defensive, "just in case" code.

Looking at this code now - with another 14 years of RCU algorithms
experience under my belt - it is obvious that this code is
unnecessary and always has been.

From RCU first principles, we cannot see a new inode
with a different inode number magically appear at this memory
address during a rcu_read_lock() side critical period.

Such behaviour would require the inode to be removed from the radix
tree, freed and then reallocated and reinitialised to a different
identity within a single RCU grace period. That is impossible
because the object we found during the lookup won't get freed until
the RCU grace period ends sometime after we drop the
rcu_read_lock()....

Hence the right thing to do here is to remove this check completely
and leave the rest of the code alone....

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ae3c049fd3a216..3ee92d3d1770db 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1701,9 +1701,6 @@  xfs_icwalk_ag(
 		for (i = 0; i < nr_found; i++) {
 			struct xfs_inode *ip = batch[i];
 
-			if (done || !xfs_icwalk_igrab(goal, ip, icw))
-				batch[i] = NULL;
-
 			/*
 			 * Update the index for the next lookup. Catch
 			 * overflows into the next AG range which can occur if
@@ -1716,8 +1713,14 @@  xfs_icwalk_ag(
 			 * us to see this inode, so another lookup from the
 			 * same index will not find it again.
 			 */
-			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
+			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno) {
+				batch[i] = NULL;
 				continue;
+			}
+
+			if (done || !xfs_icwalk_igrab(goal, ip, icw))
+				batch[i] = NULL;
+
 			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
 			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
 				done = true;