From patchwork Thu Feb 17 17:25:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12750476 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B091C433FE for ; Thu, 17 Feb 2022 17:25:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239214AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46718 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243687AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2F1FC2B04A8 for ; Thu, 17 Feb 2022 09:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645118722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HwpYCDiD1IIcuqN7MBHegrlGMOHnc0ndhLkAkKMCYFc=; b=HXwTfzWREpFKGIPmybESMXgne4mtD6fXsNiOiW+LeXgZatW+FbPZr8XWdU5Tv1FCWoizsa 0wR96S2oKpxSDHFOFiSrq6awTkTjZups+bsV9ZPRN+NzDNc71UYh1raFYdCFfqMi87ryhi UYKqMNaGCY3d2qcvCpovQ3HVrXPm/Q8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-561-Q0V5eNC7ObCaUgMQJVPMEg-1; Thu, 17 Feb 2022 12:25:20 -0500 X-MC-Unique: Q0V5eNC7ObCaUgMQJVPMEg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DFBCA18B613F for ; Thu, 17 Feb 2022 17:25:19 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8702F2DE6F for ; Thu, 17 Feb 2022 17:25:19 +0000 (UTC) From: Brian Foster To: linux-xfs@vger.kernel.org Subject: [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Date: Thu, 17 Feb 2022 12:25:15 -0500 Message-Id: <20220217172518.3842951-2-bfoster@redhat.com> In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com> References: <20220217172518.3842951-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org The XFS inode allocation algorithm aggressively reuses recently freed inodes. This is historical behavior that has been in place for quite some time, since XFS was imported to mainline Linux. Once the VFS adopted RCUwalk path lookups (also some time ago), this behavior became slightly incompatible because the inode recycle path doesn't isolate concurrent access to the inode from the VFS. This has recently manifested as problems in the VFS when XFS happens to change the type or properties of a recently unlinked inode while still involved in an RCU lookup. For example, if the VFS refers to a previous incarnation of a symlink inode, obtains the ->get_link() callback from inode_operations, and the latter happens to change to a non-symlink type via a recycle event, the ->get_link() callback pointer is reset to NULL and the lookup results in a crash. To avoid this class of problem, isolate in-core inodes for recycling with an RCU grace period. This is the same level of protection the VFS expects for inactivated inodes that are never reused, and so guarantees no further concurrent access before the type or properties of the inode change. We don't want an unconditional synchronize_rcu() event here because that would result in a significant performance impact to mixed inode allocation workloads. Fortunately, we can take advantage of the recently added deferred inactivation mechanism to mitigate the need for an RCU wait in some cases. Deferred inactivation queues and batches the on-disk freeing of recently destroyed inodes, and so for non-sustained inactivation workloads increases the likelihood that a grace period has elapsed by the time an inode is freed and observable by the allocation code as a recycle candidate. We have to respect the lifecycle rules regardless of whether an inode was inactivated or not because of the fields modified by inode reinit. Therefore, capture the current RCU grace period cookie as early as possible at destroy time and use it at lookup time to conditionally sync an RCU grace period if one hadn't elapsed before the recycle attempt. Slightly adjust struct xfs_inode to fit the new field into padding holes that conveniently preexist in the same cacheline as the deferred inactivation list. Note that this patch alone introduces a significant negative impact to mixed file allocation and removal workloads because the allocation algorithm aggressively attempts to reuse recently freed inodes. This results in frequent RCU grace period synchronization stalls in the allocation path. This problem is mitigated by forthcoming patches to track and avoid recycling of inodes with pending RCU grace periods. Signed-off-by: Brian Foster --- fs/xfs/xfs_icache.c | 37 ++++++++++++++++++++++++++++++++----- fs/xfs/xfs_inode.h | 3 ++- fs/xfs/xfs_trace.h | 8 ++++++-- 3 files changed, 40 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 9644f938990c..693896bc690f 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -351,6 +351,19 @@ xfs_iget_recycle( spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); + /* + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an + * inode recycle as for freeing an inode. I.e., we cannot reinit the + * inode structure until a grace period has elapsed from the last + * ->destroy_inode() call. In most cases a grace period has already + * elapsed if the inode was inactivated, but synchronize here as a + * last resort to guarantee correctness. + */ + if (!poll_state_synchronize_rcu(ip->i_destroy_gp)) { + cond_synchronize_rcu(ip->i_destroy_gp); + trace_xfs_iget_recycle_stall(ip); + } + ASSERT(!rwsem_is_locked(&inode->i_rwsem)); error = xfs_reinit_inode(mp, inode); if (error) { @@ -1789,7 +1802,8 @@ xfs_check_delalloc( /* Schedule the inode for reclaim. */ static void xfs_inodegc_set_reclaimable( - struct xfs_inode *ip) + struct xfs_inode *ip, + unsigned long destroy_gp) { struct xfs_mount *mp = ip->i_mount; struct xfs_perag *pag; @@ -1805,6 +1819,8 @@ xfs_inodegc_set_reclaimable( spin_lock(&ip->i_flags_lock); trace_xfs_inode_set_reclaimable(ip); + if (destroy_gp) + ip->i_destroy_gp = destroy_gp; ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING); ip->i_flags |= XFS_IRECLAIMABLE; xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), @@ -1826,7 +1842,8 @@ xfs_inodegc_inactivate( { trace_xfs_inode_inactivating(ip); xfs_inactive(ip); - xfs_inodegc_set_reclaimable(ip); + /* inactive inodes are assigned rcu state when first queued */ + xfs_inodegc_set_reclaimable(ip, 0); } void @@ -1997,7 +2014,8 @@ xfs_inodegc_want_flush_work( */ static void xfs_inodegc_queue( - struct xfs_inode *ip) + struct xfs_inode *ip, + unsigned long destroy_gp) { struct xfs_mount *mp = ip->i_mount; struct xfs_inodegc *gc; @@ -2007,6 +2025,7 @@ xfs_inodegc_queue( trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); ip->i_flags |= XFS_NEED_INACTIVE; + ip->i_destroy_gp = destroy_gp; spin_unlock(&ip->i_flags_lock); gc = get_cpu_ptr(mp->m_inodegc); @@ -2086,6 +2105,7 @@ xfs_inode_mark_reclaimable( { struct xfs_mount *mp = ip->i_mount; bool need_inactive; + unsigned long destroy_gp; XFS_STATS_INC(mp, vn_reclaim); @@ -2094,15 +2114,22 @@ xfs_inode_mark_reclaimable( */ ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS)); + /* + * Poll the RCU subsystem as early as possible and pass the grace + * period cookie along to assign to the inode. This grace period must + * expire before the struct inode can be recycled. + */ + destroy_gp = start_poll_synchronize_rcu(); + need_inactive = xfs_inode_needs_inactive(ip); if (need_inactive) { - xfs_inodegc_queue(ip); + xfs_inodegc_queue(ip, destroy_gp); return; } /* Going straight to reclaim, so drop the dquots. */ xfs_qm_dqdetach(ip); - xfs_inodegc_set_reclaimable(ip); + xfs_inodegc_set_reclaimable(ip, destroy_gp); } /* diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index b7e8f14d9fca..6ca60373ff58 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -40,8 +40,9 @@ typedef struct xfs_inode { /* Transaction and locking information. */ struct xfs_inode_log_item *i_itemp; /* logging information */ mrlock_t i_lock; /* inode lock */ - atomic_t i_pincount; /* inode pin count */ struct llist_node i_gclist; /* deferred inactivation list */ + unsigned long i_destroy_gp; /* destroy rcugp cookie */ + atomic_t i_pincount; /* inode pin count */ /* * Bitsets of inode metadata that have been checked and/or are sick. diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 4a8076ef8cb4..28ac861c3565 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -727,16 +727,19 @@ DECLARE_EVENT_CLASS(xfs_inode_class, __field(dev_t, dev) __field(xfs_ino_t, ino) __field(unsigned long, iflags) + __field(unsigned long, destroy_gp) ), TP_fast_assign( __entry->dev = VFS_I(ip)->i_sb->s_dev; __entry->ino = ip->i_ino; __entry->iflags = ip->i_flags; + __entry->destroy_gp = ip->i_destroy_gp; ), - TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx", + TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx destroy_gp 0x%lx", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, - __entry->iflags) + __entry->iflags, + __entry->destroy_gp) ) #define DEFINE_INODE_EVENT(name) \ @@ -746,6 +749,7 @@ DEFINE_EVENT(xfs_inode_class, name, \ DEFINE_INODE_EVENT(xfs_iget_skip); DEFINE_INODE_EVENT(xfs_iget_recycle); DEFINE_INODE_EVENT(xfs_iget_recycle_fail); +DEFINE_INODE_EVENT(xfs_iget_recycle_stall); DEFINE_INODE_EVENT(xfs_iget_hit); DEFINE_INODE_EVENT(xfs_iget_miss); From patchwork Thu Feb 17 17:25:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12750473 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC34CC433EF for ; Thu, 17 Feb 2022 17:25:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243699AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239214AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 852BC2B1677 for ; Thu, 17 Feb 2022 09:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645118722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fVsKDvpy3vIocTn0eYJ+ZrR52T4MtWnU49WFxgZ+PAg=; b=FiD1YPUpyTsi3RHVNI4KNySaN3SRTXfTF8geyeeOecpSK2rRtY9Hs8c7qAmCVSqfaV2xM9 uFkz3PbkZTZxLWakeI8e9s34PFZyxdZXk530cuvCEDlVXvnWO2XLrbJSQE6XRtNbjowESx zy7l4Oeo8B0IwEzy4A1ZTRXVDy6QWMU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-176-t5hXas6AMDCGzZEbY3tEcg-1; Thu, 17 Feb 2022 12:25:21 -0500 X-MC-Unique: t5hXas6AMDCGzZEbY3tEcg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5218A1091DA1 for ; Thu, 17 Feb 2022 17:25:20 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id 106DF2DE6F for ; Thu, 17 Feb 2022 17:25:19 +0000 (UTC) From: Brian Foster To: linux-xfs@vger.kernel.org Subject: [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Date: Thu, 17 Feb 2022 12:25:16 -0500 Message-Id: <20220217172518.3842951-3-bfoster@redhat.com> In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com> References: <20220217172518.3842951-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org In order to avoid aggressive recycling of in-core xfs_inode objects with pending grace periods and the subsequent RCU sync stalls involved with recycling, we must be able to identify them quickly and reliably at allocation time. Claim a new tag for the in-core inode radix tree and tag all inodes with a still pending grace period cookie as busy at the time they are made reclaimable. Note that it is not necessary to maintain consistency between the tag and grace period status once the tag is set. The busy tag simply reflects that the grace period had not expired by the time the inode was set reclaimable and therefore any reuse of the inode must first poll the RCU subsystem for subsequent expiration of the grace period. Clear the tag when the inode is recycled or reclaimed. Signed-off-by: Brian Foster --- fs/xfs/xfs_icache.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 693896bc690f..245ee0f6670b 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -32,6 +32,8 @@ #define XFS_ICI_RECLAIM_TAG 0 /* Inode has speculative preallocations (posteof or cow) to clean. */ #define XFS_ICI_BLOCKGC_TAG 1 +/* inode has pending RCU grace period when set reclaimable */ +#define XFS_ICI_BUSY_TAG 2 /* * The goal for walking incore inodes. These can correspond with incore inode @@ -274,7 +276,7 @@ xfs_perag_clear_inode_tag( if (agino != NULLAGINO) radix_tree_tag_clear(&pag->pag_ici_root, agino, tag); else - ASSERT(tag == XFS_ICI_RECLAIM_TAG); + ASSERT(tag == XFS_ICI_RECLAIM_TAG || tag == XFS_ICI_BUSY_TAG); if (tag == XFS_ICI_RECLAIM_TAG) pag->pag_ici_reclaimable--; @@ -336,6 +338,7 @@ xfs_iget_recycle( { struct xfs_mount *mp = ip->i_mount; struct inode *inode = VFS_I(ip); + xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ip->i_ino); int error; trace_xfs_iget_recycle(ip); @@ -392,8 +395,9 @@ xfs_iget_recycle( */ ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; ip->i_flags |= XFS_INEW; - xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), - XFS_ICI_RECLAIM_TAG); + + xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_BUSY_TAG); + xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG); inode->i_state = I_NEW; spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); @@ -931,6 +935,7 @@ xfs_reclaim_inode( if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(ip->i_mount, ino))) ASSERT(0); + xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_BUSY_TAG); xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_RECLAIM_TAG); spin_unlock(&pag->pag_ici_lock); @@ -1807,6 +1812,7 @@ xfs_inodegc_set_reclaimable( { struct xfs_mount *mp = ip->i_mount; struct xfs_perag *pag; + xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ip->i_ino); if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) { xfs_check_delalloc(ip, XFS_DATA_FORK); @@ -1821,10 +1827,12 @@ xfs_inodegc_set_reclaimable( trace_xfs_inode_set_reclaimable(ip); if (destroy_gp) ip->i_destroy_gp = destroy_gp; + if (!poll_state_synchronize_rcu(ip->i_destroy_gp)) + xfs_perag_set_inode_tag(pag, agino, XFS_ICI_BUSY_TAG); + ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING); ip->i_flags |= XFS_IRECLAIMABLE; - xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), - XFS_ICI_RECLAIM_TAG); + xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG); spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); From patchwork Thu Feb 17 17:25:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12750474 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C47BC4332F for ; Thu, 17 Feb 2022 17:25:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243687AbiBQRZj (ORCPT ); Thu, 17 Feb 2022 12:25:39 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243697AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id C1A122B354D for ; Thu, 17 Feb 2022 09:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645118723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YLX+6xw36AdkCV6ve33eZKotXM7KuJO4PDBGTJNhxjc=; b=TIz5GITmtckoYS+XNaj7e0COsokJOvxg5M+9vJ+Rlf/OvWlFuLv69WJfqlEZuvVdESD4ZV TgFKeKq/Nhx8DuA6d5FxqA1zeTKzfHSlJd2CbNKgH8rSwfTq7gy9pDBr6S9pyfhPkvGY+5 mnkKKCU4Q+MgX4lQnwZHEt/h+x7Wl2M= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-330-pZXJC3zBORWA6rR4t-SoHA-1; Thu, 17 Feb 2022 12:25:21 -0500 X-MC-Unique: pZXJC3zBORWA6rR4t-SoHA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B9B931091DA4 for ; Thu, 17 Feb 2022 17:25:20 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id 778E234949 for ; Thu, 17 Feb 2022 17:25:20 +0000 (UTC) From: Brian Foster To: linux-xfs@vger.kernel.org Subject: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Date: Thu, 17 Feb 2022 12:25:17 -0500 Message-Id: <20220217172518.3842951-4-bfoster@redhat.com> In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com> References: <20220217172518.3842951-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org The free inode btree currently tracks all inode chunk records with at least one free inode. This simplifies the chunk and allocation selection algorithms as free inode availability can be guaranteed after a few simple checks. This is no longer the case with busy inode avoidance, however, because busy inode state is tracked in the radix tree independent from physical allocation status. A busy inode avoidance algorithm relies on the ability to fall back to an inode chunk allocation one way or another in the event that all current free inodes are busy. Hack in a crude allocation fallback mechanism for experimental purposes. If the inode selection algorithm is unable to locate a usable inode, allow it to return -EAGAIN to perform another physical chunk allocation in the AG and retry the inode allocation. The current prototype can perform this allocation and retry sequence repeatedly because a newly allocated chunk may still be covered by busy in-core inodes in the radix tree (if it were recently freed, for example). This is inefficient and temporary. It will be properly mitigated by background chunk removal. This defers freeing of inode chunk blocks from the free of the last used inode in the chunk to a background task that only frees chunks once completely idle, thereby providing a guarantee that a new chunk allocation always adds non-busy inodes to the AG. Not-Signed-off-by: Brian Foster --- fs/xfs/libxfs/xfs_ialloc.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index b418fe0c0679..3eb41228e210 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -27,6 +27,7 @@ #include "xfs_log.h" #include "xfs_rmap.h" #include "xfs_ag.h" +#include "xfs_trans_space.h" /* * Lookup a record by ino in the btree given by cur. @@ -1689,6 +1690,7 @@ xfs_dialloc_try_ag( goto out_release; } +alloc: error = xfs_ialloc_ag_alloc(*tpp, agbp, pag); if (error < 0) goto out_release; @@ -1706,8 +1708,22 @@ xfs_dialloc_try_ag( /* Allocate an inode in the found AG */ error = xfs_dialloc_ag(*tpp, agbp, pag, parent, &ino); - if (!error) + if (!error) { *new_ino = ino; + } else if (error == -EAGAIN) { + /* + * XXX: Temporary hack to retry allocs until background chunk + * freeing is worked out. + */ + error = xfs_mod_fdblocks(pag->pag_mount, + -((int64_t)XFS_IALLOC_SPACE_RES(pag->pag_mount)), + false); + if (error) + goto out_release; + (*tpp)->t_blk_res += XFS_IALLOC_SPACE_RES(pag->pag_mount); + goto alloc; + } + return error; out_release: From patchwork Thu Feb 17 17:25:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12750475 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D99B9C43217 for ; Thu, 17 Feb 2022 17:25:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243696AbiBQRZk (ORCPT ); Thu, 17 Feb 2022 12:25:40 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:46794 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243698AbiBQRZi (ORCPT ); Thu, 17 Feb 2022 12:25:38 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2D5A02B357C for ; Thu, 17 Feb 2022 09:25:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645118723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IwakkKXg+1+m/sECrlgsyD/fOQm0prk8CgYyuRENgyo=; b=BtlrhRCEJDkxoiFMuj2jc+AQct1Bf8/Rjwpza4yaXYXhD9FYz3FMlD/woPO43Xz8jNDOL7 WUGf1xKWlX5YsQ8vkNvU6JfVJb3W+/Wyb8+bHQ4FhMAddMM0h3Maq9IMARZYn3ntCC2W+j gWPAQYiTJYuoUwhylu0PmXcjBwbwQDA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-310-kFUmKbq-PS-_m5l3X_xoEQ-1; Thu, 17 Feb 2022 12:25:22 -0500 X-MC-Unique: kFUmKbq-PS-_m5l3X_xoEQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2D60F2F4A for ; Thu, 17 Feb 2022 17:25:21 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id DF2A02DE6F for ; Thu, 17 Feb 2022 17:25:20 +0000 (UTC) From: Brian Foster To: linux-xfs@vger.kernel.org Subject: [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation Date: Thu, 17 Feb 2022 12:25:18 -0500 Message-Id: <20220217172518.3842951-5-bfoster@redhat.com> In-Reply-To: <20220217172518.3842951-1-bfoster@redhat.com> References: <20220217172518.3842951-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org Experimental algorithm to skip busy inodes on finobt based inode allocation. This is a first pass implementation intended primarily for functional and performance experimentation. The logic is intentionally kept simple to minimize the scope of changes required to demonstrate functionality. The existing finobt inode allocation algorithms are updated to filter out records that are covered by at least one still-busy inode[1] and continue scanning as appropriate based on the allocation mode. For example, near allocation mode continues scanning left and right until a usable record is found. newino mode checks the target record and then scans from the first record in the tree until a usable record is found. If the associated algorithm cannot find a usable record, it falls back to new chunk allocation to add non-busy inodes to the selection pool and restarts. [1] As I write this, it occurs to me this logic could be further improved to compare the first busy inode against the first free inode in the record without disrupting the subsequent inode selection logic. Not-Signed-off-by: Brian Foster --- fs/xfs/libxfs/xfs_ialloc.c | 81 +++++++++++++++++++++++++++++++++++--- 1 file changed, 76 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 3eb41228e210..c79c85327cf4 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1248,12 +1248,53 @@ xfs_dialloc_ag_inobt( return error; } +#define XFS_LOOKUP_BATCH XFS_INODES_PER_CHUNK +#define XFS_ICI_BUSY_TAG 2 +STATIC bool +xfs_dialloc_ag_inobt_rec_busy( + struct xfs_perag *pag, + struct xfs_inobt_rec_incore *rec) +{ + struct xfs_inode *batch[XFS_LOOKUP_BATCH]; + struct xfs_inode *ip; + int found, i; + xfs_agino_t startino = rec->ir_startino; + bool busy = false; + unsigned long destroy_gp; + + rcu_read_lock(); + found = radix_tree_gang_lookup_tag(&pag->pag_ici_root, (void **) batch, + startino, XFS_LOOKUP_BATCH, + XFS_ICI_BUSY_TAG); + for (i = 0; i < found; i++) { + ip = batch[i]; + spin_lock(&ip->i_flags_lock); + if (ip->i_ino >= startino + XFS_INODES_PER_CHUNK) { + spin_unlock(&ip->i_flags_lock); + break; + } + destroy_gp = ip->i_destroy_gp; + spin_unlock(&ip->i_flags_lock); + + if (!poll_state_synchronize_rcu(destroy_gp)) { + busy = true; + break; + } + } + rcu_read_unlock(); + trace_printk("%d: agno %d startino 0x%x found %d busy %d caller %pS\n", + __LINE__, pag->pag_agno, startino, found, busy, + (void *) _RET_IP_); + return busy; +} + /* * Use the free inode btree to allocate an inode based on distance from the * parent. Note that the provided cursor may be deleted and replaced. */ STATIC int xfs_dialloc_ag_finobt_near( + struct xfs_perag *pag, xfs_agino_t pagino, struct xfs_btree_cur **ocur, struct xfs_inobt_rec_incore *rec) @@ -1281,8 +1322,10 @@ xfs_dialloc_ag_finobt_near( * existence is enough. */ if (pagino >= rec->ir_startino && - pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK)) - return 0; + pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK)) { + if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec)) + return 0; + } } error = xfs_btree_dup_cursor(lcur, &rcur); @@ -1306,6 +1349,21 @@ xfs_dialloc_ag_finobt_near( error = -EFSCORRUPTED; goto error_rcur; } + + while (i == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, rec)) { + error = xfs_ialloc_next_rec(lcur, rec, &i, 1); + if (error) + goto error_rcur; + i = !i; + } + + while (j == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, &rrec)) { + error = xfs_ialloc_next_rec(rcur, &rrec, &j, 0); + if (error) + goto error_rcur; + j = !j; + } + if (i == 1 && j == 1) { /* * Both the left and right records are valid. Choose the closer @@ -1327,6 +1385,9 @@ xfs_dialloc_ag_finobt_near( } else if (i == 1) { /* only the left record is valid */ xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR); + } else { + error = -EAGAIN; + goto error_rcur; } return 0; @@ -1342,6 +1403,7 @@ xfs_dialloc_ag_finobt_near( */ STATIC int xfs_dialloc_ag_finobt_newino( + struct xfs_perag *pag, struct xfs_agi *agi, struct xfs_btree_cur *cur, struct xfs_inobt_rec_incore *rec) @@ -1360,7 +1422,8 @@ xfs_dialloc_ag_finobt_newino( return error; if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) return -EFSCORRUPTED; - return 0; + if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec)) + return 0; } } @@ -1379,6 +1442,14 @@ xfs_dialloc_ag_finobt_newino( if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) return -EFSCORRUPTED; + while (xfs_dialloc_ag_inobt_rec_busy(pag, rec)) { + error = xfs_ialloc_next_rec(cur, rec, &i, 0); + if (error) + return error; + if (i == 1) + return -EAGAIN; + } + return 0; } @@ -1470,9 +1541,9 @@ xfs_dialloc_ag( * not, consider the agi hint or find the first free inode in the AG. */ if (pag->pag_agno == pagno) - error = xfs_dialloc_ag_finobt_near(pagino, &cur, &rec); + error = xfs_dialloc_ag_finobt_near(pag, pagino, &cur, &rec); else - error = xfs_dialloc_ag_finobt_newino(agi, cur, &rec); + error = xfs_dialloc_ag_finobt_newino(pag, agi, cur, &rec); if (error) goto error_cur;