From patchwork Thu Feb 17 17:25:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Brian Foster <bfoster@redhat.com>
X-Patchwork-Id: 12750476
Return-Path: <linux-xfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B091C433FE
	for <linux-xfs@archiver.kernel.org>; Thu, 17 Feb 2022 17:25:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239214AbiBQRZi (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Thu, 17 Feb 2022 12:25:38 -0500
Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46718 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S243687AbiBQRZi (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 17 Feb 2022 12:25:38 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2F1FC2B04A8
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 09:25:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1645118722;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=HwpYCDiD1IIcuqN7MBHegrlGMOHnc0ndhLkAkKMCYFc=;
        b=HXwTfzWREpFKGIPmybESMXgne4mtD6fXsNiOiW+LeXgZatW+FbPZr8XWdU5Tv1FCWoizsa
        0wR96S2oKpxSDHFOFiSrq6awTkTjZups+bsV9ZPRN+NzDNc71UYh1raFYdCFfqMi87ryhi
        UYKqMNaGCY3d2qcvCpovQ3HVrXPm/Q8=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-561-Q0V5eNC7ObCaUgMQJVPMEg-1; Thu, 17 Feb 2022 12:25:20 -0500
X-MC-Unique: Q0V5eNC7ObCaUgMQJVPMEg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DFBCA18B613F
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:19 +0000 (UTC)
Received: from bfoster.redhat.com (unknown [10.22.16.110])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 8702F2DE6F
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:19 +0000 (UTC)
From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle
Date: Thu, 17 Feb 2022 12:25:15 -0500
Message-Id: <20220217172518.3842951-2-bfoster@redhat.com>
In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com>
References: <20220217172518.3842951-1-bfoster@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

The XFS inode allocation algorithm aggressively reuses recently
freed inodes. This is historical behavior that has been in place for
quite some time, since XFS was imported to mainline Linux. Once the
VFS adopted RCUwalk path lookups (also some time ago), this behavior
became slightly incompatible because the inode recycle path doesn't
isolate concurrent access to the inode from the VFS.

This has recently manifested as problems in the VFS when XFS happens
to change the type or properties of a recently unlinked inode while
still involved in an RCU lookup. For example, if the VFS refers to a
previous incarnation of a symlink inode, obtains the ->get_link()
callback from inode_operations, and the latter happens to change to
a non-symlink type via a recycle event, the ->get_link() callback
pointer is reset to NULL and the lookup results in a crash.

To avoid this class of problem, isolate in-core inodes for recycling
with an RCU grace period. This is the same level of protection the
VFS expects for inactivated inodes that are never reused, and so
guarantees no further concurrent access before the type or
properties of the inode change. We don't want an unconditional
synchronize_rcu() event here because that would result in a
significant performance impact to mixed inode allocation workloads.

Fortunately, we can take advantage of the recently added deferred
inactivation mechanism to mitigate the need for an RCU wait in some
cases. Deferred inactivation queues and batches the on-disk freeing
of recently destroyed inodes, and so for non-sustained inactivation
workloads increases the likelihood that a grace period has elapsed
by the time an inode is freed and observable by the allocation code
as a recycle candidate. We have to respect the lifecycle rules
regardless of whether an inode was inactivated or not because of the
fields modified by inode reinit. Therefore, capture the current RCU
grace period cookie as early as possible at destroy time and use it
at lookup time to conditionally sync an RCU grace period if one
hadn't elapsed before the recycle attempt. Slightly adjust struct
xfs_inode to fit the new field into padding holes that conveniently
preexist in the same cacheline as the deferred inactivation list.

Note that this patch alone introduces a significant negative impact
to mixed file allocation and removal workloads because the
allocation algorithm aggressively attempts to reuse recently freed
inodes. This results in frequent RCU grace period synchronization
stalls in the allocation path. This problem is mitigated by
forthcoming patches to track and avoid recycling of inodes with
pending RCU grace periods.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_inode.h  |  3 ++-
 fs/xfs/xfs_trace.h  |  8 ++++++--
 3 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9644f938990c..693896bc690f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -351,6 +351,19 @@ xfs_iget_recycle(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
+	/*
+	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
+	 * inode recycle as for freeing an inode. I.e., we cannot reinit the
+	 * inode structure until a grace period has elapsed from the last
+	 * ->destroy_inode() call. In most cases a grace period has already
+	 * elapsed if the inode was inactivated, but synchronize here as a
+	 * last resort to guarantee correctness.
+	 */
+	if (!poll_state_synchronize_rcu(ip->i_destroy_gp)) {
+		cond_synchronize_rcu(ip->i_destroy_gp);
+		trace_xfs_iget_recycle_stall(ip);
+	}
+
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	error = xfs_reinit_inode(mp, inode);
 	if (error) {
@@ -1789,7 +1802,8 @@ xfs_check_delalloc(
 /* Schedule the inode for reclaim. */
 static void
 xfs_inodegc_set_reclaimable(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	unsigned long		destroy_gp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
@@ -1805,6 +1819,8 @@ xfs_inodegc_set_reclaimable(
 	spin_lock(&ip->i_flags_lock);
 
 	trace_xfs_inode_set_reclaimable(ip);
+	if (destroy_gp)
+		ip->i_destroy_gp = destroy_gp;
 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
 	ip->i_flags |= XFS_IRECLAIMABLE;
 	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
@@ -1826,7 +1842,8 @@ xfs_inodegc_inactivate(
 {
 	trace_xfs_inode_inactivating(ip);
 	xfs_inactive(ip);
-	xfs_inodegc_set_reclaimable(ip);
+	/* inactive inodes are assigned rcu state when first queued */
+	xfs_inodegc_set_reclaimable(ip, 0);
 }
 
 void
@@ -1997,7 +2014,8 @@ xfs_inodegc_want_flush_work(
  */
 static void
 xfs_inodegc_queue(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	unsigned long		destroy_gp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_inodegc	*gc;
@@ -2007,6 +2025,7 @@ xfs_inodegc_queue(
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags |= XFS_NEED_INACTIVE;
+	ip->i_destroy_gp = destroy_gp;
 	spin_unlock(&ip->i_flags_lock);
 
 	gc = get_cpu_ptr(mp->m_inodegc);
@@ -2086,6 +2105,7 @@ xfs_inode_mark_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			need_inactive;
+	unsigned long		destroy_gp;
 
 	XFS_STATS_INC(mp, vn_reclaim);
 
@@ -2094,15 +2114,22 @@ xfs_inode_mark_reclaimable(
 	 */
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
+	/*
+	 * Poll the RCU subsystem as early as possible and pass the grace
+	 * period cookie along to assign to the inode. This grace period must
+	 * expire before the struct inode can be recycled.
+	 */
+	destroy_gp = start_poll_synchronize_rcu();
+
 	need_inactive = xfs_inode_needs_inactive(ip);
 	if (need_inactive) {
-		xfs_inodegc_queue(ip);
+		xfs_inodegc_queue(ip, destroy_gp);
 		return;
 	}
 
 	/* Going straight to reclaim, so drop the dquots. */
 	xfs_qm_dqdetach(ip);
-	xfs_inodegc_set_reclaimable(ip);
+	xfs_inodegc_set_reclaimable(ip, destroy_gp);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b7e8f14d9fca..6ca60373ff58 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,8 +40,9 @@ typedef struct xfs_inode {
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
-	atomic_t		i_pincount;	/* inode pin count */
 	struct llist_node	i_gclist;	/* deferred inactivation list */
+	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
+	atomic_t		i_pincount;	/* inode pin count */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4a8076ef8cb4..28ac861c3565 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -727,16 +727,19 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
 		__field(unsigned long, iflags)
+		__field(unsigned long, destroy_gp)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
 		__entry->iflags = ip->i_flags;
+		__entry->destroy_gp = ip->i_destroy_gp;
 	),
-	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx destroy_gp 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
-		  __entry->iflags)
+		  __entry->iflags,
+		  __entry->destroy_gp)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -746,6 +749,7 @@ DEFINE_EVENT(xfs_inode_class, name, \
 DEFINE_INODE_EVENT(xfs_iget_skip);
 DEFINE_INODE_EVENT(xfs_iget_recycle);
 DEFINE_INODE_EVENT(xfs_iget_recycle_fail);
+DEFINE_INODE_EVENT(xfs_iget_recycle_stall);
 DEFINE_INODE_EVENT(xfs_iget_hit);
 DEFINE_INODE_EVENT(xfs_iget_miss);
 

From patchwork Thu Feb 17 17:25:16 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Brian Foster <bfoster@redhat.com>
X-Patchwork-Id: 12750473
Return-Path: <linux-xfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DC34CC433EF
	for <linux-xfs@archiver.kernel.org>; Thu, 17 Feb 2022 17:25:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S243699AbiBQRZi (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Thu, 17 Feb 2022 12:25:38 -0500
Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46720 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239214AbiBQRZi (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 17 Feb 2022 12:25:38 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 852BC2B1677
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 09:25:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1645118722;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=fVsKDvpy3vIocTn0eYJ+ZrR52T4MtWnU49WFxgZ+PAg=;
        b=FiD1YPUpyTsi3RHVNI4KNySaN3SRTXfTF8geyeeOecpSK2rRtY9Hs8c7qAmCVSqfaV2xM9
        uFkz3PbkZTZxLWakeI8e9s34PFZyxdZXk530cuvCEDlVXvnWO2XLrbJSQE6XRtNbjowESx
        zy7l4Oeo8B0IwEzy4A1ZTRXVDy6QWMU=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-176-t5hXas6AMDCGzZEbY3tEcg-1; Thu, 17 Feb 2022 12:25:21 -0500
X-MC-Unique: t5hXas6AMDCGzZEbY3tEcg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5218A1091DA1
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:20 +0000 (UTC)
Received: from bfoster.redhat.com (unknown [10.22.16.110])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 106DF2DE6F
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:19 +0000 (UTC)
From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace
 periods as busy
Date: Thu, 17 Feb 2022 12:25:16 -0500
Message-Id: <20220217172518.3842951-3-bfoster@redhat.com>
In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com>
References: <20220217172518.3842951-1-bfoster@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

In order to avoid aggressive recycling of in-core xfs_inode objects with
pending grace periods and the subsequent RCU sync stalls involved with
recycling, we must be able to identify them quickly and reliably at
allocation time. Claim a new tag for the in-core inode radix tree and
tag all inodes with a still pending grace period cookie as busy at the
time they are made reclaimable.

Note that it is not necessary to maintain consistency between the tag
and grace period status once the tag is set. The busy tag simply
reflects that the grace period had not expired by the time the inode was
set reclaimable and therefore any reuse of the inode must first poll the
RCU subsystem for subsequent expiration of the grace period. Clear the
tag when the inode is recycled or reclaimed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 693896bc690f..245ee0f6670b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -32,6 +32,8 @@
 #define XFS_ICI_RECLAIM_TAG	0
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
+/* inode has pending RCU grace period when set reclaimable  */
+#define XFS_ICI_BUSY_TAG	2
 
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
@@ -274,7 +276,7 @@ xfs_perag_clear_inode_tag(
 	if (agino != NULLAGINO)
 		radix_tree_tag_clear(&pag->pag_ici_root, agino, tag);
 	else
-		ASSERT(tag == XFS_ICI_RECLAIM_TAG);
+		ASSERT(tag == XFS_ICI_RECLAIM_TAG || tag == XFS_ICI_BUSY_TAG);
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable--;
@@ -336,6 +338,7 @@ xfs_iget_recycle(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	int			error;
 
 	trace_xfs_iget_recycle(ip);
@@ -392,8 +395,9 @@ xfs_iget_recycle(
 	 */
 	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 	ip->i_flags |= XFS_INEW;
-	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
 	inode->i_state = I_NEW;
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -931,6 +935,7 @@ xfs_reclaim_inode(
 	if (!radix_tree_delete(&pag->pag_ici_root,
 				XFS_INO_TO_AGINO(ip->i_mount, ino)))
 		ASSERT(0);
+	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_BUSY_TAG);
 	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_RECLAIM_TAG);
 	spin_unlock(&pag->pag_ici_lock);
 
@@ -1807,6 +1812,7 @@ xfs_inodegc_set_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 
 	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
@@ -1821,10 +1827,12 @@ xfs_inodegc_set_reclaimable(
 	trace_xfs_inode_set_reclaimable(ip);
 	if (destroy_gp)
 		ip->i_destroy_gp = destroy_gp;
+	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
+		xfs_perag_set_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
+
 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
 	ip->i_flags |= XFS_IRECLAIMABLE;
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);

From patchwork Thu Feb 17 17:25:17 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Brian Foster <bfoster@redhat.com>
X-Patchwork-Id: 12750474
Return-Path: <linux-xfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7C47BC4332F
	for <linux-xfs@archiver.kernel.org>; Thu, 17 Feb 2022 17:25:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S243687AbiBQRZj (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Thu, 17 Feb 2022 12:25:39 -0500
Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46758 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S243697AbiBQRZi (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 17 Feb 2022 12:25:38 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id C1A122B354D
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 09:25:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1645118723;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=YLX+6xw36AdkCV6ve33eZKotXM7KuJO4PDBGTJNhxjc=;
        b=TIz5GITmtckoYS+XNaj7e0COsokJOvxg5M+9vJ+Rlf/OvWlFuLv69WJfqlEZuvVdESD4ZV
        TgFKeKq/Nhx8DuA6d5FxqA1zeTKzfHSlJd2CbNKgH8rSwfTq7gy9pDBr6S9pyfhPkvGY+5
        mnkKKCU4Q+MgX4lQnwZHEt/h+x7Wl2M=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-330-pZXJC3zBORWA6rR4t-SoHA-1; Thu, 17 Feb 2022 12:25:21 -0500
X-MC-Unique: pZXJC3zBORWA6rR4t-SoHA-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B9B931091DA4
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:20 +0000 (UTC)
Received: from bfoster.redhat.com (unknown [10.22.16.110])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 778E234949
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:20 +0000 (UTC)
From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
Date: Thu, 17 Feb 2022 12:25:17 -0500
Message-Id: <20220217172518.3842951-4-bfoster@redhat.com>
In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com>
References: <20220217172518.3842951-1-bfoster@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

The free inode btree currently tracks all inode chunk records with
at least one free inode. This simplifies the chunk and allocation
selection algorithms as free inode availability can be guaranteed
after a few simple checks. This is no longer the case with busy
inode avoidance, however, because busy inode state is tracked in the
radix tree independent from physical allocation status.

A busy inode avoidance algorithm relies on the ability to fall back
to an inode chunk allocation one way or another in the event that
all current free inodes are busy. Hack in a crude allocation
fallback mechanism for experimental purposes. If the inode selection
algorithm is unable to locate a usable inode, allow it to return
-EAGAIN to perform another physical chunk allocation in the AG and
retry the inode allocation.

The current prototype can perform this allocation and retry sequence
repeatedly because a newly allocated chunk may still be covered by
busy in-core inodes in the radix tree (if it were recently freed,
for example). This is inefficient and temporary. It will be properly
mitigated by background chunk removal. This defers freeing of inode
chunk blocks from the free of the last used inode in the chunk to a
background task that only frees chunks once completely idle, thereby
providing a guarantee that a new chunk allocation always adds
non-busy inodes to the AG.

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index b418fe0c0679..3eb41228e210 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -27,6 +27,7 @@
 #include "xfs_log.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
+#include "xfs_trans_space.h"
 
 /*
  * Lookup a record by ino in the btree given by cur.
@@ -1689,6 +1690,7 @@ xfs_dialloc_try_ag(
 			goto out_release;
 		}
 
+alloc:
 		error = xfs_ialloc_ag_alloc(*tpp, agbp, pag);
 		if (error < 0)
 			goto out_release;
@@ -1706,8 +1708,22 @@ xfs_dialloc_try_ag(
 
 	/* Allocate an inode in the found AG */
 	error = xfs_dialloc_ag(*tpp, agbp, pag, parent, &ino);
-	if (!error)
+	if (!error) {
 		*new_ino = ino;
+	} else if (error == -EAGAIN) {
+		/*
+		 * XXX: Temporary hack to retry allocs until background chunk
+		 * freeing is worked out.
+		 */
+		error = xfs_mod_fdblocks(pag->pag_mount,
+			       -((int64_t)XFS_IALLOC_SPACE_RES(pag->pag_mount)),
+			       false);
+		if (error)
+			goto out_release;
+		(*tpp)->t_blk_res += XFS_IALLOC_SPACE_RES(pag->pag_mount);
+		goto alloc;
+	}
+
 	return error;
 
 out_release:

From patchwork Thu Feb 17 17:25:18 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Brian Foster <bfoster@redhat.com>
X-Patchwork-Id: 12750475
Return-Path: <linux-xfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D99B9C43217
	for <linux-xfs@archiver.kernel.org>; Thu, 17 Feb 2022 17:25:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S243696AbiBQRZk (ORCPT <rfc822;linux-xfs@archiver.kernel.org>);
        Thu, 17 Feb 2022 12:25:40 -0500
Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46794 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S243698AbiBQRZi (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 17 Feb 2022 12:25:38 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2D5A02B357C
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 09:25:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1645118723;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=IwakkKXg+1+m/sECrlgsyD/fOQm0prk8CgYyuRENgyo=;
        b=BtlrhRCEJDkxoiFMuj2jc+AQct1Bf8/Rjwpza4yaXYXhD9FYz3FMlD/woPO43Xz8jNDOL7
        WUGf1xKWlX5YsQ8vkNvU6JfVJb3W+/Wyb8+bHQ4FhMAddMM0h3Maq9IMARZYn3ntCC2W+j
        gWPAQYiTJYuoUwhylu0PmXcjBwbwQDA=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-310-kFUmKbq-PS-_m5l3X_xoEQ-1; Thu, 17 Feb 2022 12:25:22 -0500
X-MC-Unique: kFUmKbq-PS-_m5l3X_xoEQ-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2D60F2F4A
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:21 +0000 (UTC)
Received: from bfoster.redhat.com (unknown [10.22.16.110])
        by smtp.corp.redhat.com (Postfix) with ESMTP id DF2A02DE6F
        for <linux-xfs@vger.kernel.org>; Thu, 17 Feb 2022 17:25:20 +0000 (UTC)
From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation
Date: Thu, 17 Feb 2022 12:25:18 -0500
Message-Id: <20220217172518.3842951-5-bfoster@redhat.com>
In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com>
References: <20220217172518.3842951-1-bfoster@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

Experimental algorithm to skip busy inodes on finobt based inode
allocation. This is a first pass implementation intended primarily
for functional and performance experimentation. The logic is
intentionally kept simple to minimize the scope of changes required
to demonstrate functionality.

The existing finobt inode allocation algorithms are updated to
filter out records that are covered by at least one still-busy
inode[1] and continue scanning as appropriate based on the
allocation mode. For example, near allocation mode continues
scanning left and right until a usable record is found. newino mode
checks the target record and then scans from the first record in the
tree until a usable record is found. If the associated algorithm
cannot find a usable record, it falls back to new chunk allocation
to add non-busy inodes to the selection pool and restarts.

[1] As I write this, it occurs to me this logic could be further
improved to compare the first busy inode against the first free
inode in the record without disrupting the subsequent inode
selection logic.

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 81 +++++++++++++++++++++++++++++++++++---
 1 file changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 3eb41228e210..c79c85327cf4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1248,12 +1248,53 @@ xfs_dialloc_ag_inobt(
 	return error;
 }
 
+#define XFS_LOOKUP_BATCH	XFS_INODES_PER_CHUNK
+#define XFS_ICI_BUSY_TAG	2
+STATIC bool
+xfs_dialloc_ag_inobt_rec_busy(
+	struct xfs_perag		*pag,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	struct xfs_inode		*batch[XFS_LOOKUP_BATCH];
+	struct xfs_inode		*ip;
+	int				found, i;
+	xfs_agino_t			startino = rec->ir_startino;
+	bool				busy = false;
+	unsigned long			destroy_gp;
+
+	rcu_read_lock();
+	found = radix_tree_gang_lookup_tag(&pag->pag_ici_root, (void **) batch,
+					   startino, XFS_LOOKUP_BATCH,
+					   XFS_ICI_BUSY_TAG);
+	for (i = 0; i < found; i++) {
+		ip = batch[i];
+		spin_lock(&ip->i_flags_lock);
+		if (ip->i_ino >= startino + XFS_INODES_PER_CHUNK) {
+			spin_unlock(&ip->i_flags_lock);
+			break;
+		}
+		destroy_gp = ip->i_destroy_gp;
+		spin_unlock(&ip->i_flags_lock);
+
+		if (!poll_state_synchronize_rcu(destroy_gp)) {
+			busy = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	trace_printk("%d: agno %d startino 0x%x found %d busy %d caller %pS\n",
+		     __LINE__, pag->pag_agno, startino, found, busy,
+		     (void *) _RET_IP_);
+	return busy;
+}
+
 /*
  * Use the free inode btree to allocate an inode based on distance from the
  * parent. Note that the provided cursor may be deleted and replaced.
  */
 STATIC int
 xfs_dialloc_ag_finobt_near(
+	struct xfs_perag		*pag,
 	xfs_agino_t			pagino,
 	struct xfs_btree_cur		**ocur,
 	struct xfs_inobt_rec_incore	*rec)
@@ -1281,8 +1322,10 @@ xfs_dialloc_ag_finobt_near(
 		 * existence is enough.
 		 */
 		if (pagino >= rec->ir_startino &&
-		    pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK))
-			return 0;
+		    pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK)) {
+			if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec))
+				return 0;
+		}
 	}
 
 	error = xfs_btree_dup_cursor(lcur, &rcur);
@@ -1306,6 +1349,21 @@ xfs_dialloc_ag_finobt_near(
 		error = -EFSCORRUPTED;
 		goto error_rcur;
 	}
+
+	while (i == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, rec)) {
+		error = xfs_ialloc_next_rec(lcur, rec, &i, 1);
+		if (error)
+			goto error_rcur;
+		i = !i;
+	}
+
+	while (j == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, &rrec)) {
+		error = xfs_ialloc_next_rec(rcur, &rrec, &j, 0);
+		if (error)
+			goto error_rcur;
+		j = !j;
+	}
+
 	if (i == 1 && j == 1) {
 		/*
 		 * Both the left and right records are valid. Choose the closer
@@ -1327,6 +1385,9 @@ xfs_dialloc_ag_finobt_near(
 	} else if (i == 1) {
 		/* only the left record is valid */
 		xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
+	} else {
+		error = -EAGAIN;
+		goto error_rcur;
 	}
 
 	return 0;
@@ -1342,6 +1403,7 @@ xfs_dialloc_ag_finobt_near(
  */
 STATIC int
 xfs_dialloc_ag_finobt_newino(
+	struct xfs_perag		*pag,
 	struct xfs_agi			*agi,
 	struct xfs_btree_cur		*cur,
 	struct xfs_inobt_rec_incore	*rec)
@@ -1360,7 +1422,8 @@ xfs_dialloc_ag_finobt_newino(
 				return error;
 			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
 				return -EFSCORRUPTED;
-			return 0;
+			if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec))
+				return 0;
 		}
 	}
 
@@ -1379,6 +1442,14 @@ xfs_dialloc_ag_finobt_newino(
 	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
 		return -EFSCORRUPTED;
 
+	while (xfs_dialloc_ag_inobt_rec_busy(pag, rec)) {
+		error = xfs_ialloc_next_rec(cur, rec, &i, 0);
+		if (error)
+			return error;
+		if (i == 1)
+			return -EAGAIN;
+	}
+
 	return 0;
 }
 
@@ -1470,9 +1541,9 @@ xfs_dialloc_ag(
 	 * not, consider the agi hint or find the first free inode in the AG.
 	 */
 	if (pag->pag_agno == pagno)
-		error = xfs_dialloc_ag_finobt_near(pagino, &cur, &rec);
+		error = xfs_dialloc_ag_finobt_near(pag, pagino, &cur, &rec);
 	else
-		error = xfs_dialloc_ag_finobt_newino(agi, cur, &rec);
+		error = xfs_dialloc_ag_finobt_newino(pag, agi, cur, &rec);
 	if (error)
 		goto error_cur;