diff mbox series

[1/6] xfs: use s_inodes in xfs_qm_dqrele_all_inodes

Message ID 161671807853.621936.16639622639548774275.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: clean up incore inode walk functions | expand

Commit Message

Darrick J. Wong March 26, 2021, 12:21 a.m. UTC
From: Christoph Hellwig <hch@lst.de>

Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
given that function simplify wants to iterate all live inodes known
to the VFS.  Just iterate over the s_inodes list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c      |    2 +-
 fs/xfs/xfs_icache.h      |    2 ++
 fs/xfs/xfs_qm_syscalls.c |   56 +++++++++++++++++++++++++++++-----------------
 3 files changed, 38 insertions(+), 22 deletions(-)

Comments

Dave Chinner March 30, 2021, 12:44 a.m. UTC | #1
On Thu, Mar 25, 2021 at 05:21:18PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
> given that function simplify wants to iterate all live inodes known
> to the VFS.  Just iterate over the s_inodes list.

I'm not sure that assertion is true. We attach dquots during inode
inactivation after the VFS has removed the inode from the s_inodes
list and evicted the inode. Hence there is a window between the
inode being removed from the sb->s_inodes lists and it being marked
XFS_IRECLAIMABLE where we can attach dquots to the inode.

Indeed, an inode marked XFS_IRECLAIMABLE that has gone through
evict -> destroy -> inactive -> nlink != 0 -> xfs_free_ eofblocks()
can have referenced dquots attached to it and require dqrele() to be
called to release them.

Hence I think that xfs_qm_dqrele_all_inodes() is broken if all it is
doing is walking vfs referenced inodes, because it doesn't actually
release the dquots attached to reclaimable inodes. If this did
actually release all dquots, then there wouldn't be a need for the
xfs_qm_dqdetach() call in xfs_reclaim_inode() just before it's
handed to RCU to be freed....

Cheers,

Dave.
Darrick J. Wong March 30, 2021, 2:36 a.m. UTC | #2
On Tue, Mar 30, 2021 at 11:44:07AM +1100, Dave Chinner wrote:
> On Thu, Mar 25, 2021 at 05:21:18PM -0700, Darrick J. Wong wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
> > given that function simplify wants to iterate all live inodes known
> > to the VFS.  Just iterate over the s_inodes list.
> 
> I'm not sure that assertion is true. We attach dquots during inode
> inactivation after the VFS has removed the inode from the s_inodes
> list and evicted the inode. Hence there is a window between the
> inode being removed from the sb->s_inodes lists and it being marked
> XFS_IRECLAIMABLE where we can attach dquots to the inode.
> 
> Indeed, an inode marked XFS_IRECLAIMABLE that has gone through
> evict -> destroy -> inactive -> nlink != 0 -> xfs_free_ eofblocks()
> can have referenced dquots attached to it and require dqrele() to be
> called to release them.

Why do the dquots need to remain attached after destroy_inode?  We can
easily reattach them during inactivation (v3 did this), and I don't know
why an inode needs dquots once we're through making metadata updates.

> Hence I think that xfs_qm_dqrele_all_inodes() is broken if all it is
> doing is walking vfs referenced inodes, because it doesn't actually
> release the dquots attached to reclaimable inodes. If this did
> actually release all dquots, then there wouldn't be a need for the
> xfs_qm_dqdetach() call in xfs_reclaim_inode() just before it's
> handed to RCU to be freed....

Why does it work now, then?  The current code /also/ leaves the dquots
attached to reclaimable inodes, and the quotaoff scan ignores
IRECLAIMABLE inodes.  Has it simply been the case that the dqpurge spins
until reclaim runs, and reclaim gets run quickly enough (or quotaoff runs
infrequently enough) that nobody's complained?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 30, 2021, 3:07 a.m. UTC | #3
On Mon, Mar 29, 2021 at 07:36:56PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 30, 2021 at 11:44:07AM +1100, Dave Chinner wrote:
> > On Thu, Mar 25, 2021 at 05:21:18PM -0700, Darrick J. Wong wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
> > > given that function simplify wants to iterate all live inodes known
> > > to the VFS.  Just iterate over the s_inodes list.
> > 
> > I'm not sure that assertion is true. We attach dquots during inode
> > inactivation after the VFS has removed the inode from the s_inodes
> > list and evicted the inode. Hence there is a window between the
> > inode being removed from the sb->s_inodes lists and it being marked
> > XFS_IRECLAIMABLE where we can attach dquots to the inode.
> > 
> > Indeed, an inode marked XFS_IRECLAIMABLE that has gone through
> > evict -> destroy -> inactive -> nlink != 0 -> xfs_free_ eofblocks()
> > can have referenced dquots attached to it and require dqrele() to be
> > called to release them.
> 
> Why do the dquots need to remain attached after destroy_inode?

They don't. But that's not the problem here.

> We can
> easily reattach them during inactivation (v3 did this), and I don't know
> why an inode needs dquots once we're through making metadata updates.

Yes, they get re-attached for truncation, attr removal, EOF block
freeing, etc. Only on the unlinked inode path in inactivation do
they get removed once all the work tha tmodifies the dquots is done.

But many of the paths don't detach them again because they are
multi-use.  e.g xfs_free_eofblocks() will attach dquots, but doesn't
detatch them because it's called from more placed than than the
inactivation path.

I'm sure this can all be cleaned up, but I *really* don't like the
idea of a "walk all XFS inodes" scan that actually only walks the
inodes with VFS references and not -all XFS inodes-.

And there's other problems with doing sb->s_inodes list walks -
namely the global lock. While we are doing this walk (might be tens
of millions of inodes!) we can hold the s_inode_list_lock for a long
time and we cannot instantiate new inodes or evict inodes to/from
the cache while that lock is held. The XFS inode walk is lockless
and we don't hold off anything to do wiht cache instantiation and
freeing, so it has less impact on the running system.

If everything is clean and don't block on locks anywhere, the
s_inodes list walk needs a cond_resched() in it. Again, tens
(hundreds) of millions of inodes can be on that list mean it can
hold the CPU for a long time.

Next, igrab() takes a reference to the inode which will mark them
referenced. THis walk grabs every inode in the filesysetm cache,
so marks them all referenced and makes it harder to reclaim them
under memory pressure. This perturbs working set behaviour.

inode list walks and igrab/iput don't come for free - they perturb
the working set, LRU orders, cause lock contention, long tail
latencies, etc. The XFS inode cache walk might not be the prettiest
thing, but it doesn't have any of these nasty side effects.

So, in general, I don't think we should be adding new inode list
walks anywhere, not even deep in XFS where nobody else might care...

> > Hence I think that xfs_qm_dqrele_all_inodes() is broken if all it is
> > doing is walking vfs referenced inodes, because it doesn't actually
> > release the dquots attached to reclaimable inodes. If this did
> > actually release all dquots, then there wouldn't be a need for the
> > xfs_qm_dqdetach() call in xfs_reclaim_inode() just before it's
> > handed to RCU to be freed....
> 
> Why does it work now, then?  The current code /also/ leaves the dquots
> attached to reclaimable inodes, and the quotaoff scan ignores
> IRECLAIMABLE inodes.

Luck, I think.

> Has it simply been the case that the dqpurge spins
> until reclaim runs, and reclaim gets run quickly enough (or quotaoff runs
> infrequently enough) that nobody's complained?

Yup, that's my assumption - quotaoff is rare, inode reclaim runs
every 5s - and so we haven't noticed it because nobody has looked
closely at how dquots vs inode reclaim works recently...

Cheers,

Dave.
Darrick J. Wong March 30, 2021, 4:06 a.m. UTC | #4
On Tue, Mar 30, 2021 at 02:07:47PM +1100, Dave Chinner wrote:
> On Mon, Mar 29, 2021 at 07:36:56PM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 30, 2021 at 11:44:07AM +1100, Dave Chinner wrote:
> > > On Thu, Mar 25, 2021 at 05:21:18PM -0700, Darrick J. Wong wrote:
> > > > From: Christoph Hellwig <hch@lst.de>
> > > > 
> > > > Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
> > > > given that function simplify wants to iterate all live inodes known
> > > > to the VFS.  Just iterate over the s_inodes list.
> > > 
> > > I'm not sure that assertion is true. We attach dquots during inode
> > > inactivation after the VFS has removed the inode from the s_inodes
> > > list and evicted the inode. Hence there is a window between the
> > > inode being removed from the sb->s_inodes lists and it being marked
> > > XFS_IRECLAIMABLE where we can attach dquots to the inode.
> > > 
> > > Indeed, an inode marked XFS_IRECLAIMABLE that has gone through
> > > evict -> destroy -> inactive -> nlink != 0 -> xfs_free_ eofblocks()
> > > can have referenced dquots attached to it and require dqrele() to be
> > > called to release them.
> > 
> > Why do the dquots need to remain attached after destroy_inode?
> 
> They don't. But that's not the problem here.

Actually, they do need to remain attached nowadays, because COW blocks
are accounted as incore dquot reservations so we can't let the dquots
drop until the COW fork gets cleaned out.

Granted I guess I did have a patch that changed the dquot lifecycle so
that they would stay in memory after the refcount dropped to zero, even
if they had incore reservations.

...and now I finally see the plot twist that turns this into the
*Fourth* part of Yet Another Quota Restructuring.  This time I get to
reimplement quotaoff! :P

> > We can
> > easily reattach them during inactivation (v3 did this), and I don't know
> > why an inode needs dquots once we're through making metadata updates.
> 
> Yes, they get re-attached for truncation, attr removal, EOF block
> freeing, etc. Only on the unlinked inode path in inactivation do
> they get removed once all the work tha tmodifies the dquots is done.
> 
> But many of the paths don't detach them again because they are
> multi-use.  e.g xfs_free_eofblocks() will attach dquots, but doesn't
> detatch them because it's called from more placed than than the
> inactivation path.
> 
> I'm sure this can all be cleaned up, but I *really* don't like the
> idea of a "walk all XFS inodes" scan that actually only walks the
> inodes with VFS references and not -all XFS inodes-.
> 
> And there's other problems with doing sb->s_inodes list walks -
> namely the global lock. While we are doing this walk (might be tens
> of millions of inodes!) we can hold the s_inode_list_lock for a long
> time and we cannot instantiate new inodes or evict inodes to/from
> the cache while that lock is held. The XFS inode walk is lockless
> and we don't hold off anything to do wiht cache instantiation and
> freeing, so it has less impact on the running system.
> 
> If everything is clean and don't block on locks anywhere, the
> s_inodes list walk needs a cond_resched() in it. Again, tens
> (hundreds) of millions of inodes can be on that list mean it can
> hold the CPU for a long time.

Yeah, I had wondered how good an idea it was to replace batch lookups
with a list walk...

> Next, igrab() takes a reference to the inode which will mark them
> referenced. THis walk grabs every inode in the filesysetm cache,
> so marks them all referenced and makes it harder to reclaim them
> under memory pressure. This perturbs working set behaviour.
> 
> inode list walks and igrab/iput don't come for free - they perturb
> the working set, LRU orders, cause lock contention, long tail
> latencies, etc. The XFS inode cache walk might not be the prettiest
> thing, but it doesn't have any of these nasty side effects.
> 
> So, in general, I don't think we should be adding new inode list
> walks anywhere, not even deep in XFS where nobody else might care...

...but the current quotaoff behavior has /all/ of these problems too.

I think you and I hashed out on IRC that quotaoff could simply take the
ILOCK and the i_flags lock of every inode that isn't INEW, RECLAIMING,
or INACTIVATING; drop the dquots, and drop the locks, and then dqpurge
would only have to wait for the inodes that are actively being reclaimed
or inactivated.

I'll give that a try ... eventually, but I wouldn't be too confident
that I'll get all this turned around before I have to shut the door next
next Thursday.

> > > Hence I think that xfs_qm_dqrele_all_inodes() is broken if all it is
> > > doing is walking vfs referenced inodes, because it doesn't actually
> > > release the dquots attached to reclaimable inodes. If this did
> > > actually release all dquots, then there wouldn't be a need for the
> > > xfs_qm_dqdetach() call in xfs_reclaim_inode() just before it's
> > > handed to RCU to be freed....
> > 
> > Why does it work now, then?  The current code /also/ leaves the dquots
> > attached to reclaimable inodes, and the quotaoff scan ignores
> > IRECLAIMABLE inodes.
> 
> Luck, I think.
> 
> > Has it simply been the case that the dqpurge spins
> > until reclaim runs, and reclaim gets run quickly enough (or quotaoff runs
> > infrequently enough) that nobody's complained?
> 
> Yup, that's my assumption - quotaoff is rare, inode reclaim runs
> every 5s - and so we haven't noticed it because nobody has looked
> closely at how dquots vs inode reclaim works recently...

Yes, that does explain some of the weird test duration quirks I see in
xfs/305....

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 31, 2021, 1:34 a.m. UTC | #5
On Mon, Mar 29, 2021 at 09:06:25PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 30, 2021 at 02:07:47PM +1100, Dave Chinner wrote:
> > Next, igrab() takes a reference to the inode which will mark them
> > referenced. THis walk grabs every inode in the filesysetm cache,
> > so marks them all referenced and makes it harder to reclaim them
> > under memory pressure. This perturbs working set behaviour.
> > 
> > inode list walks and igrab/iput don't come for free - they perturb
> > the working set, LRU orders, cause lock contention, long tail
> > latencies, etc. The XFS inode cache walk might not be the prettiest
> > thing, but it doesn't have any of these nasty side effects.
> > 
> > So, in general, I don't think we should be adding new inode list
> > walks anywhere, not even deep in XFS where nobody else might care...
> 
> ...but the current quotaoff behavior has /all/ of these problems too.

Yup, ISTR in the past it didn't even take inode references, but I
may be misremembering...

> I think you and I hashed out on IRC that quotaoff could simply take the
> ILOCK and the i_flags lock of every inode that isn't INEW, RECLAIMING,
> or INACTIVATING; drop the dquots, and drop the locks, and then dqpurge
> would only have to wait for the inodes that are actively being reclaimed
> or inactivated.

I'm pretty sure that we only need to avoid INEW and IRECLAIM.
Anything else that has dquots attached needs to be locked and have
them removed. Those that are marked INEW will not have dquots
attached, and those marked IRECLAIM are in the process of being torn
down so their dquots will be released in short order.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3c81daca0e9a..9a307bb738bd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -235,7 +235,7 @@  xfs_inode_clear_reclaim_tag(
 	xfs_perag_clear_reclaim_tag(pag);
 }
 
-static void
+void
 xfs_inew_wait(
 	struct xfs_inode	*ip)
 {
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d1fddb152420..d70d93c2bb10 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -78,4 +78,6 @@  int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
+void xfs_inew_wait(struct xfs_inode *ip);
+
 #endif
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 11f1e2fbf22f..76efae956fa8 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -748,41 +748,27 @@  xfs_qm_scall_getquota_next(
 	return error;
 }
 
-STATIC int
+static void
 xfs_dqrele_inode(
 	struct xfs_inode	*ip,
-	void			*args)
+	uint			flags)
 {
-	uint			*flags = args;
-
-	/* skip quota inodes */
-	if (ip == ip->i_mount->m_quotainfo->qi_uquotaip ||
-	    ip == ip->i_mount->m_quotainfo->qi_gquotaip ||
-	    ip == ip->i_mount->m_quotainfo->qi_pquotaip) {
-		ASSERT(ip->i_udquot == NULL);
-		ASSERT(ip->i_gdquot == NULL);
-		ASSERT(ip->i_pdquot == NULL);
-		return 0;
-	}
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if ((*flags & XFS_UQUOTA_ACCT) && ip->i_udquot) {
+	if ((flags & XFS_UQUOTA_ACCT) && ip->i_udquot) {
 		xfs_qm_dqrele(ip->i_udquot);
 		ip->i_udquot = NULL;
 	}
-	if ((*flags & XFS_GQUOTA_ACCT) && ip->i_gdquot) {
+	if ((flags & XFS_GQUOTA_ACCT) && ip->i_gdquot) {
 		xfs_qm_dqrele(ip->i_gdquot);
 		ip->i_gdquot = NULL;
 	}
-	if ((*flags & XFS_PQUOTA_ACCT) && ip->i_pdquot) {
+	if ((flags & XFS_PQUOTA_ACCT) && ip->i_pdquot) {
 		xfs_qm_dqrele(ip->i_pdquot);
 		ip->i_pdquot = NULL;
 	}
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	return 0;
 }
 
-
 /*
  * Go thru all the inodes in the file system, releasing their dquots.
  *
@@ -794,7 +780,35 @@  xfs_qm_dqrele_all_inodes(
 	struct xfs_mount	*mp,
 	uint			flags)
 {
+	struct super_block	*sb = mp->m_super;
+	struct inode		*inode, *old_inode = NULL;
+
 	ASSERT(mp->m_quotainfo);
-	xfs_inode_walk(mp, XFS_INODE_WALK_INEW_WAIT, xfs_dqrele_inode,
-			&flags, XFS_ICI_NO_TAG);
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		struct xfs_inode	*ip = XFS_I(inode);
+
+		if (XFS_FORCED_SHUTDOWN(mp))
+			break;
+		if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
+			continue;
+		if (xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
+			continue;
+		if (!igrab(inode))
+			continue;
+		spin_unlock(&sb->s_inode_list_lock);
+
+		iput(old_inode);
+		old_inode = inode;
+
+		if (xfs_iflags_test(ip, XFS_INEW))
+			xfs_inew_wait(ip);
+		xfs_dqrele_inode(ip, flags);
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+
+	iput(old_inode);
 }