diff mbox series

[10/11] xfs: parallelize inode inactivation

Message ID 161543199635.1947934.2885924822578773349.stgit@magnolia (mailing list archive)
State Superseded, archived
Headers show
Series xfs: deferred inode inactivation | expand

Commit Message

Darrick J. Wong March 11, 2021, 3:06 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   62 ++++++++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_mount.c  |    3 ++
 fs/xfs/xfs_mount.h  |    4 ++-
 fs/xfs/xfs_super.c  |    1 -
 4 files changed, 52 insertions(+), 18 deletions(-)

Comments

Christoph Hellwig March 15, 2021, 6:55 p.m. UTC | #1
On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split the inode inactivation work into per-AG work items so that we can
> take advantage of parallelization.

Any reason this isn't just done from the beginning?  As-is is just
seems to create a fair amount of churn.
Darrick J. Wong March 15, 2021, 7:03 p.m. UTC | #2
On Mon, Mar 15, 2021 at 06:55:51PM +0000, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split the inode inactivation work into per-AG work items so that we can
> > take advantage of parallelization.
> 
> Any reason this isn't just done from the beginning?  As-is is just
> seems to create a fair amount of churn.

I felt like the first patch was already too long at 1100 lines.

I don't mind combining them, but with the usual proviso that I don't
want the whole series to stall on reviewers going back and forth on this
point without anyone offering an RVB.

--D
Dave Chinner March 23, 2021, 10:21 p.m. UTC | #3
On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split the inode inactivation work into per-AG work items so that we can
> take advantage of parallelization.

How does this scale out when we have thousands of AGs?

I'm guessing that the gc_workqueue has the default "unbound"
parallelism that means it will run up to 4 kworkers per CPU at a
time? Which means we could have hundreds of ags trying to hammer on
inactivations at the same time? And so bash hard on the log and
completely starve the syscall front end of log space?

It seems to me that this needs to bound the amount of concurrent
work to quite low numbers - even though it is per-ag, we do not want
this to swamp the system in kworkers blocked on log reservations
when such concurrency it not necessary.

Cheers,

Dave.
Darrick J. Wong March 24, 2021, 3:52 a.m. UTC | #4
On Wed, Mar 24, 2021 at 09:21:52AM +1100, Dave Chinner wrote:
> On Wed, Mar 10, 2021 at 07:06:36PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split the inode inactivation work into per-AG work items so that we can
> > take advantage of parallelization.
> 
> How does this scale out when we have thousands of AGs?

Welllll... :)

> I'm guessing that the gc_workqueue has the default "unbound"
> parallelism that means it will run up to 4 kworkers per CPU at a
> time? Which means we could have hundreds of ags trying to hammer on
> inactivations at the same time? And so bash hard on the log and
> completely starve the syscall front end of log space?

Yep.  This is a blunt instrument to throttle the frontend when the
backend has too much queued.

> It seems to me that this needs to bound the amount of concurrent
> work to quite low numbers - even though it is per-ag, we do not want
> this to swamp the system in kworkers blocked on log reservations
> when such concurrency it not necessary.

Two months ago, I /did/ propose limiting the parallelism of those
unbound workqueues to an estimate of what the data device could
handle[1], and you said on IRC[2]:

[1] https://lore.kernel.org/linux-xfs/161040739544.1582286.11068012972712089066.stgit@magnolia/T/#ma0cd1bf1447ccfb66d615cab624c8df67d17f9b0

[2] (14:01:26) dchinner: "Assume parallelism is equal to number of disks"?

(14:02:22) dchinner: For spinning disks we want more parallelism than
that to hide seek latency - we want multiple IOs per disk so that the
elevator can re-order them and minimise seek distances across a set of
IOs

(14:02:37) dchinner: that can't be done if we are only issuing a single
IO per disk at a time

(14:03:30) djwong: 2 per spinning disk?

(14:03:32) dchinner: The more IO you can throw at spinning disks, the
lower the average seek penalty for any given IO....

(14:04:01) dchinner: ANd then there is hardware raid with caches and NVRAM....

(14:04:25) dchinner: This is why I find this sort of knob "misguided"

(14:05:01) dchinner: the "best value" is going to change according to
workload, storage stack config and hardware

(14:05:48) dchinner: Even for SSDs, a thread per CPU is not enough
parallelism if we are doing blocking IO in each thread

(14:07:07) dchinner: The device concurrency is actually the CTQ depth of
the underlying hardware, because that's how many IOs we can keep in
flight at once...

(14:08:06) dchinner: so, yeah, I'm not a fan of having knobs to "tune"
concurrency

(14:09:55) dchinner: As long as we have "enough" for decent performance
on a majority of setups, even if it is "too much" for some cases, that
is better than trying to find some magic optimal number for everyone....

(14:10:16) djwong: so do we simply let the workqueues spawn however many
threads and keep the bottleneck at the storage?

(14:10:39) djwong: (or log grant)

(14:11:08) dchinner: That's the idea - the concurrency backs up at the
serialisation point in the stack

(14:11:23) djwong: for blockgc and inactivation i don't think that's a
huge deal since we're probably going to run out of AGs or log space
anyway

(14:11:25) dchinner: that's actually valuable information if you are
doing perofrmance evaluation

(14:11:51) dchinner: we know immediately where the concurrency
bottleneck is....

(14:13:15) dchinner: backing up in xfs-conv indicates that we're either
running out of log space, the IO completion is contending on inode locks
with concurrent IO submission, etc

(14:14:13) dchinner: and if it's teh xfs-buf kworkers that are going
crazy, we know it's metadata IO rather than user data IO that is having
problems....

(14:15:27) dchinner: seeing multiple active xfs-cil worker threads
indicates pipelined concurrent pushes being active, implying either the
CIL is filling faster than it can be pushed or there are lots of
fsync()s being issued

(14:16:57) dchinner: so, yeah, actually being able to see excessive
concurrency at the kworker level just from teh process listing tells us
a lot from an analysis POV....

---

Now we have unrestricted unbound workqueues, and I'm definitely
getting to collect data on contention bottlenecks -- when there are a
lot of small files, AFAICT we mostly end up contending on the grant
heads, and when we have heavily fragmented images to kill off then it
tends to shift to the AG buffer locks.

So how do we estimate a reasonable upper bound on the number of workers?
Given that most of the gc workers will be probably be contending on
AG[FI] buffer locks I guess we could say min(agcount, nrcpus)?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 594d340bbe37..d5f580b92e48 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -245,11 +245,13 @@  xfs_inode_clear_reclaim_tag(
 /* Queue a new inode gc pass if there are inodes needing inactivation. */
 static void
 xfs_inodegc_queue(
-	struct xfs_mount        *mp)
+	struct xfs_perag	*pag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
+
 	rcu_read_lock();
-	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG))
-		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work,
+	if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INACTIVE_TAG))
+		queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work,
 				msecs_to_jiffies(xfs_inodegc_centisecs * 10));
 	rcu_read_unlock();
 }
@@ -272,7 +274,7 @@  xfs_perag_set_inactive_tag(
 	spin_unlock(&mp->m_perag_lock);
 
 	/* schedule periodic background inode inactivation */
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(pag);
 
 	trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_);
 }
@@ -2074,8 +2076,9 @@  void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_mount	*mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_inodegc_work);
+	struct xfs_perag	*pag = container_of(to_delayed_work(work),
+					struct xfs_perag, pag_inodegc_work);
+	struct xfs_mount	*mp = pag->pag_mount;
 	int			error;
 
 	/*
@@ -2095,25 +2098,44 @@  xfs_inodegc_worker(
 		xfs_err(mp, "inode inactivation failed, error %d", error);
 
 	sb_end_write(mp->m_super);
-	xfs_inodegc_queue(mp);
+	xfs_inodegc_queue(pag);
 }
 
-/* Force all queued inode inactivation work to run immediately. */
-void
-xfs_inodegc_force(
-	struct xfs_mount	*mp)
+/* Garbage collect all inactive inodes in an AG immediately. */
+static inline bool
+xfs_inodegc_force_pag(
+	struct xfs_perag	*pag)
 {
+	struct xfs_mount	*mp = pag->pag_mount;
+
 	/*
 	 * In order to reset the delay timer to run immediately, we have to
 	 * cancel the work item and requeue it with a zero timer value.  We
 	 * don't care if the worker races with our requeue, because at worst
 	 * we iterate the radix tree and find no inodes to inactivate.
 	 */
-	if (!cancel_delayed_work(&mp->m_inodegc_work))
+	if (!cancel_delayed_work(&pag->pag_inodegc_work))
+		return false;
+
+	queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0);
+	return true;
+}
+
+/* Force all queued inode inactivation work to run immediately. */
+void
+xfs_inodegc_force(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	bool			queued = false;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		queued |= xfs_inodegc_force_pag(pag);
+	if (!queued)
 		return;
 
-	queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
-	flush_delayed_work(&mp->m_inodegc_work);
+	flush_workqueue(mp->m_gc_workqueue);
 }
 
 /* Stop all queued inactivation work. */
@@ -2121,7 +2143,11 @@  void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
-	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 }
 
 /* Schedule deferred inode inactivation work. */
@@ -2129,5 +2155,9 @@  void
 xfs_inodegc_start(
 	struct xfs_mount	*mp)
 {
-	xfs_inodegc_queue(mp);
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG)
+		xfs_inodegc_queue(pag);
 }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index cd015e3d72fc..a5963061485c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -127,6 +127,7 @@  __xfs_free_perag(
 	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
 
 	ASSERT(!delayed_work_pending(&pag->pag_blockgc_work));
+	ASSERT(!delayed_work_pending(&pag->pag_inodegc_work));
 	ASSERT(atomic_read(&pag->pag_ref) == 0);
 	kmem_free(pag);
 }
@@ -148,6 +149,7 @@  xfs_free_perag(
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
+		cancel_delayed_work_sync(&pag->pag_inodegc_work);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -204,6 +206,7 @@  xfs_initialize_perag(
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
+		INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 
 		error = xfs_buf_hash_init(pag);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ce00ad47b8ea..835c07d00cd7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -177,7 +177,6 @@  typedef struct xfs_mount {
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
-	struct delayed_work	m_inodegc_work; /* background inode inactive */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -370,6 +369,9 @@  typedef struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/* background inode inactivation */
+	struct delayed_work	pag_inodegc_work;
+
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8d0142487fc7..566e5657c1b0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1879,7 +1879,6 @@  static int xfs_init_fs_context(
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
-	INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 	/*
 	 * We don't create the finobt per-ag space reservation until after log