From patchwork Fri Mar 5 05:10:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117689 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A071FC433DB for ; Fri, 5 Mar 2021 05:12:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 73DC765013 for ; Fri, 5 Mar 2021 05:12:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229516AbhCEFL7 (ORCPT ); Fri, 5 Mar 2021 00:11:59 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55457 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229517AbhCEFLx (ORCPT ); Fri, 5 Mar 2021 00:11:53 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id DF2EB63068 for ; Fri, 5 Mar 2021 16:11:50 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbnj-1M for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lYm-PS for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 01/45] xfs: initialise attr fork on inode create Date: Fri, 5 Mar 2021 16:10:59 +1100 Message-Id: <20210305051143.182133-2-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=k0WJhPe4DV5wXDVh-LAA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When we allocate a new inode, we often need to add an attribute to the inode as part of the create. This can happen as a result of needing to add default ACLs or security labels before the inode is made visible to userspace. This is highly inefficient right now. We do the create transaction to allocate the inode, then we do an "add attr fork" transaction to modify the just created empty inode to set the inode fork offset to allow attributes to be stored, then we go and do the attribute creation. This means 3 transactions instead of 1 to allocate an inode, and this greatly increases the load on the CIL commit code, resulting in excessive contention on the CIL spin locks and performance degradation: 18.99% [kernel] [k] __pv_queued_spin_lock_slowpath 3.57% [kernel] [k] do_raw_spin_lock 2.51% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.48% [kernel] [k] memcpy 2.34% [kernel] [k] xfs_log_commit_cil The typical profile resulting from running fsmark on a selinux enabled filesytem is adds this overhead to the create path: - 15.30% xfs_init_security - 15.23% security_inode_init_security - 13.05% xfs_initxattrs - 12.94% xfs_attr_set - 6.75% xfs_bmap_add_attrfork - 5.51% xfs_trans_commit - 5.48% __xfs_trans_commit - 5.35% xfs_log_commit_cil - 3.86% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.70% xfs_trans_alloc 0.52% xfs_trans_reserve - 5.41% xfs_attr_set_args - 5.39% xfs_attr_set_shortform.constprop.0 - 4.46% xfs_trans_commit - 4.46% __xfs_trans_commit - 4.33% xfs_log_commit_cil - 2.74% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 0.60% xfs_inode_item_format 0.90% xfs_attr_try_sf_addname - 1.99% selinux_inode_init_security - 1.02% security_sid_to_context_force - 1.00% security_sid_to_context_core - 0.92% sidtab_entry_to_string - 0.90% sidtab_sid2str_get 0.59% sidtab_sid2str_put.part.0 - 0.82% selinux_determine_inode_label - 0.77% security_transition_sid 0.70% security_compute_sid.part.0 And fsmark creation rate performance drops by ~25%. The key point to note here is that half the additional overhead comes from adding the attribute fork to the newly created inode. That's crazy, considering we can do this same thing at inode create time with a couple of lines of code and no extra overhead. So, if we know we are going to add an attribute immediately after creating the inode, let's just initialise the attribute fork inside the create transaction and chop that whole chunk of code out of the create fast path. This completely removes the performance drop caused by enabling SELinux, and the profile looks like: - 8.99% xfs_init_security - 9.00% security_inode_init_security - 6.43% xfs_initxattrs - 6.37% xfs_attr_set - 5.45% xfs_attr_set_args - 5.42% xfs_attr_set_shortform.constprop.0 - 4.51% xfs_trans_commit - 4.54% __xfs_trans_commit - 4.59% xfs_log_commit_cil - 2.67% _raw_spin_lock - 3.28% do_raw_spin_lock 3.08% __pv_queued_spin_lock_slowpath 0.66% xfs_inode_item_format - 0.90% xfs_attr_try_sf_addname - 0.60% xfs_trans_alloc - 2.35% selinux_inode_init_security - 1.25% security_sid_to_context_force - 1.21% security_sid_to_context_core - 1.19% sidtab_entry_to_string - 1.20% sidtab_sid2str_get - 0.86% sidtab_sid2str_put.part.0 - 0.62% _raw_spin_lock_irqsave - 0.77% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.84% selinux_determine_inode_label - 0.83% security_transition_sid 0.86% security_compute_sid.part.0 Which indicates the XFS overhead of creating the selinux xattr has been halved. This doesn't fix the CIL lock contention problem, just means it's not a limiting factor for this workload. Lock contention in the security subsystems is going to be an issue soon, though... Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/libxfs/xfs_bmap.c | 9 ++++----- fs/xfs/libxfs/xfs_inode_fork.c | 20 +++++++++++++++----- fs/xfs/libxfs/xfs_inode_fork.h | 2 ++ fs/xfs/xfs_inode.c | 24 +++++++++++++++++++++--- fs/xfs/xfs_inode.h | 6 ++++-- fs/xfs/xfs_iops.c | 34 +++++++++++++++++++++++++++++++++- fs/xfs/xfs_qm.c | 2 +- fs/xfs/xfs_symlink.c | 2 +- 8 files changed, 81 insertions(+), 18 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index e0905ad171f0..5574d345d066 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1027,7 +1027,9 @@ xfs_bmap_add_attrfork_local( return -EFSCORRUPTED; } -/* Set an inode attr fork off based on the format */ +/* + * Set an inode attr fork offset based on the format of the data fork. + */ int xfs_bmap_set_attrforkoff( struct xfs_inode *ip, @@ -1092,10 +1094,7 @@ xfs_bmap_add_attrfork( goto trans_cancel; ASSERT(ip->i_afp == NULL); - ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone, - GFP_KERNEL | __GFP_NOFAIL); - - ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS; + ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0); ip->i_afp->if_flags = XFS_IFEXTENTS; logflags = 0; switch (ip->i_df.if_format) { diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c index e080d7e07643..c606c1a77e5a 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.c +++ b/fs/xfs/libxfs/xfs_inode_fork.c @@ -282,6 +282,19 @@ xfs_dfork_attr_shortform_size( return be16_to_cpu(atp->hdr.totsize); } +struct xfs_ifork * +xfs_ifork_alloc( + enum xfs_dinode_fmt format, + xfs_extnum_t nextents) +{ + struct xfs_ifork *ifp; + + ifp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL); + ifp->if_format = format; + ifp->if_nextents = nextents; + return ifp; +} + int xfs_iformat_attr_fork( struct xfs_inode *ip, @@ -293,11 +306,8 @@ xfs_iformat_attr_fork( * Initialize the extent count early, as the per-format routines may * depend on it. */ - ip->i_afp = kmem_cache_zalloc(xfs_ifork_zone, GFP_NOFS | __GFP_NOFAIL); - ip->i_afp->if_format = dip->di_aformat; - if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */ - ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS; - ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents); + ip->i_afp = xfs_ifork_alloc(dip->di_aformat, + be16_to_cpu(dip->di_anextents)); switch (ip->i_afp->if_format) { case XFS_DINODE_FMT_LOCAL: diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index 9e2137cd7372..a0717ab0e5c5 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -141,6 +141,8 @@ static inline int8_t xfs_ifork_format(struct xfs_ifork *ifp) return ifp->if_format; } +struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format, + xfs_extnum_t nextents); struct xfs_ifork *xfs_iext_state_to_fork(struct xfs_inode *ip, int state); int xfs_iformat_data_fork(struct xfs_inode *, struct xfs_dinode *); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 46a861d55e48..bed2beb169e4 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -774,6 +774,7 @@ xfs_init_new_inode( xfs_nlink_t nlink, dev_t rdev, prid_t prid, + bool init_xattrs, struct xfs_inode **ipp) { struct inode *dir = pip ? VFS_I(pip) : NULL; @@ -877,6 +878,20 @@ xfs_init_new_inode( ASSERT(0); } + /* + * If we need to create attributes immediately after allocating the + * inode, initialise an empty attribute fork right now. We use the + * default fork offset for attributes here as we don't know exactly what + * size or how many attributes we might be adding. We can do this + * safely here because we know the data fork is completely empty and + * this saves us from needing to run a separate transaction to set the + * fork offset in the immediate future. + */ + if (init_xattrs) { + ip->i_d.di_forkoff = xfs_default_attroffset(ip) >> 3; + ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0); + } + /* * Log the new values stuffed into the inode. */ @@ -910,6 +925,7 @@ xfs_dir_ialloc( xfs_nlink_t nlink, dev_t rdev, prid_t prid, + bool init_xattrs, struct xfs_inode **ipp) { struct xfs_buf *agibp; @@ -937,7 +953,7 @@ xfs_dir_ialloc( ASSERT(ino != NULLFSINO); return xfs_init_new_inode(mnt_userns, *tpp, dp, ino, mode, nlink, rdev, - prid, ipp); + prid, init_xattrs, ipp); } /* @@ -982,6 +998,7 @@ xfs_create( struct xfs_name *name, umode_t mode, dev_t rdev, + bool init_xattrs, xfs_inode_t **ipp) { int is_dir = S_ISDIR(mode); @@ -1052,7 +1069,7 @@ xfs_create( * pointing to itself. */ error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, is_dir ? 2 : 1, rdev, - prid, &ip); + prid, init_xattrs, &ip); if (error) goto out_trans_cancel; @@ -1171,7 +1188,8 @@ xfs_create_tmpfile( if (error) goto out_release_dquots; - error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid, &ip); + error = xfs_dir_ialloc(mnt_userns, &tp, dp, mode, 0, 0, prid, + false, &ip); if (error) goto out_trans_cancel; diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 88ee4c3930ae..a2cacdb76d55 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -371,7 +371,8 @@ int xfs_lookup(struct xfs_inode *dp, struct xfs_name *name, struct xfs_inode **ipp, struct xfs_name *ci_name); int xfs_create(struct user_namespace *mnt_userns, struct xfs_inode *dp, struct xfs_name *name, - umode_t mode, dev_t rdev, struct xfs_inode **ipp); + umode_t mode, dev_t rdev, bool need_xattr, + struct xfs_inode **ipp); int xfs_create_tmpfile(struct user_namespace *mnt_userns, struct xfs_inode *dp, umode_t mode, struct xfs_inode **ipp); @@ -413,7 +414,8 @@ xfs_extlen_t xfs_get_cowextsz_hint(struct xfs_inode *ip); int xfs_dir_ialloc(struct user_namespace *mnt_userns, struct xfs_trans **tpp, struct xfs_inode *dp, umode_t mode, xfs_nlink_t nlink, dev_t dev, - prid_t prid, struct xfs_inode **ipp); + prid_t prid, bool need_xattr, + struct xfs_inode **ipp); static inline int xfs_itruncate_extents( diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 66ebccb5a6ff..a9d466b78646 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -126,6 +126,37 @@ xfs_cleanup_inode( xfs_remove(XFS_I(dir), &teardown, XFS_I(inode)); } +/* + * Check to see if we are likely to need an extended attribute to be added to + * the inode we are about to allocate. This allows the attribute fork to be + * created during the inode allocation, reducing the number of transactions we + * need to do in this fast path. + * + * The security checks are optimistic, but not guaranteed. The two LSMs that + * require xattrs to be added here (selinux and smack) are also the only two + * LSMs that add a sb->s_security structure to the superblock. Hence if security + * is enabled and sb->s_security is set, we have a pretty good idea that we are + * going to be asked to add a security xattr immediately after allocating the + * xfs inode and instantiating the VFS inode. + */ +static inline bool +xfs_create_need_xattr( + struct inode *dir, + struct posix_acl *default_acl, + struct posix_acl *acl) +{ + if (acl) + return true; + if (default_acl) + return true; + if (!IS_ENABLED(CONFIG_SECURITY)) + return false; + if (dir->i_sb->s_security) + return true; + return false; +} + + STATIC int xfs_generic_create( struct user_namespace *mnt_userns, @@ -163,7 +194,8 @@ xfs_generic_create( if (!tmpfile) { error = xfs_create(mnt_userns, XFS_I(dir), &name, mode, rdev, - &ip); + xfs_create_need_xattr(dir, default_acl, acl), + &ip); } else { error = xfs_create_tmpfile(mnt_userns, XFS_I(dir), mode, &ip); } diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index bfa4164990b1..6fde318b9fed 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -788,7 +788,7 @@ xfs_qm_qino_alloc( if (need_alloc) { error = xfs_dir_ialloc(&init_user_ns, &tp, NULL, S_IFREG, 1, 0, - 0, ipp); + 0, false, ipp); if (error) { xfs_trans_cancel(tp); return error; diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 1379013d74b8..162cf69bd982 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -223,7 +223,7 @@ xfs_symlink( * Allocate an inode for the symlink. */ error = xfs_dir_ialloc(mnt_userns, &tp, dp, S_IFLNK | (mode & ~S_IFMT), - 1, 0, prid, &ip); + 1, 0, prid, false, &ip); if (error) goto out_trans_cancel; From patchwork Fri Mar 5 05:11:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117741 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC0F3C4332D for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id ACEA465016 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229516AbhCEFMy (ORCPT ); Fri, 5 Mar 2021 00:12:54 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40568 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229631AbhCEFMN (ORCPT ); Fri, 5 Mar 2021 00:12:13 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id DB65D78AC85 for ; Fri, 5 Mar 2021 16:11:55 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbnm-2i for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lYp-Qt for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 02/45] xfs: log stripe roundoff is a property of the log Date: Fri, 5 Mar 2021 16:11:00 +1100 Message-Id: <20210305051143.182133-3-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=VwQbUJbxAAAA:8 a=D3aekpjFca_zLI-q5woA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We don't need to look at the xfs_mount and superblock every time we need to do an iclog roundoff calculation. The property is fixed for the life of the log, so store the roundoff in the log at mount time and use that everywhere. On a debug build: $ size fs/xfs/xfs_log.o.* text data bss dec hex filename 27360 560 8 27928 6d18 fs/xfs/xfs_log.o.orig 27219 560 8 27787 6c8b fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/libxfs/xfs_log_format.h | 3 -- fs/xfs/xfs_log.c | 59 ++++++++++++++-------------------- fs/xfs/xfs_log_priv.h | 2 ++ 3 files changed, 27 insertions(+), 37 deletions(-) diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 8bd00da6d2a4..16587219549c 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t; #define XLOG_MIN_RECORD_BSHIFT 14 /* 16384 == 1 << 14 */ #define XLOG_BIG_RECORD_BSHIFT 15 /* 32k == 1 << 15 */ #define XLOG_MAX_RECORD_BSHIFT 18 /* 256k == 1 << 18 */ -#define XLOG_BTOLSUNIT(log, b) (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \ - (log)->l_mp->m_sb.sb_logsunit) -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit) #define XLOG_HEADER_SIZE 512 diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 06041834daa3..fa284f26d10e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1399,6 +1399,11 @@ xlog_alloc_log( xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0); log->l_curr_cycle = 1; /* 0 is bad since this is initial value */ + if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) + log->l_iclog_roundoff = mp->m_sb.sb_logsunit; + else + log->l_iclog_roundoff = BBSIZE; + xlog_grant_head_init(&log->l_reserve_head); xlog_grant_head_init(&log->l_write_head); @@ -1852,29 +1857,15 @@ xlog_calc_iclog_size( uint32_t *roundoff) { uint32_t count_init, count; - bool use_lsunit; - - use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) && - log->l_mp->m_sb.sb_logsunit > 1; /* Add for LR header */ count_init = log->l_iclog_hsize + iclog->ic_offset; + count = roundup(count_init, log->l_iclog_roundoff); - /* Round out the log write size */ - if (use_lsunit) { - /* we have a v2 stripe unit to use */ - count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init)); - } else { - count = BBTOB(BTOBB(count_init)); - } - - ASSERT(count >= count_init); *roundoff = count - count_init; - if (use_lsunit) - ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit); - else - ASSERT(*roundoff < BBTOB(1)); + ASSERT(count >= count_init); + ASSERT(*roundoff < log->l_iclog_roundoff); return count; } @@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs( log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize); /* Round up to next log-sunit */ - if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) && - log->l_mp->m_sb.sb_logsunit > 1) { - uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit); - log->l_curr_block = roundup(log->l_curr_block, sunit_bb); + if (log->l_iclog_roundoff > BBSIZE) { + log->l_curr_block = roundup(log->l_curr_block, + BTOBB(log->l_iclog_roundoff)); } if (log->l_curr_block >= log->l_logBBsize) { @@ -3404,12 +3394,11 @@ xfs_log_ticket_get( * Figure out the total log space unit (in bytes) that would be * required for a log ticket. */ -int -xfs_log_calc_unit_res( - struct xfs_mount *mp, +static int +xlog_calc_unit_res( + struct xlog *log, int unit_bytes) { - struct xlog *log = mp->m_log; int iclog_space; uint num_headers; @@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res( /* for commit-rec LR header - note: padding will subsume the ophdr */ unit_bytes += log->l_iclog_hsize; - /* for roundoff padding for transaction data and one for commit record */ - if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) { - /* log su roundoff */ - unit_bytes += 2 * mp->m_sb.sb_logsunit; - } else { - /* BB roundoff */ - unit_bytes += 2 * BBSIZE; - } + /* roundoff padding for transaction data and one for commit record */ + unit_bytes += 2 * log->l_iclog_roundoff; return unit_bytes; } +int +xfs_log_calc_unit_res( + struct xfs_mount *mp, + int unit_bytes) +{ + return xlog_calc_unit_res(mp->m_log, unit_bytes); +} + /* * Allocate and initialise a new log ticket. */ @@ -3513,7 +3504,7 @@ xlog_ticket_alloc( tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL); - unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes); + unit_res = xlog_calc_unit_res(log, unit_bytes); atomic_set(&tic->t_ref, 1); tic->t_task = current; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 1c6fdbf3d506..037950cf1061 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -436,6 +436,8 @@ struct xlog { #endif /* log recovery lsn tracking (for buffer submission */ xfs_lsn_t l_recovery_lsn; + + uint32_t l_iclog_roundoff;/* padding roundoff */ }; #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \ From patchwork Fri Mar 5 05:11:01 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117699 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07EC6C4332B for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DB14465016 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229528AbhCEFMR (ORCPT ); Fri, 5 Mar 2021 00:12:17 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:51700 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229551AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id D93B2107992 for ; Fri, 5 Mar 2021 16:11:50 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbno-3e for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lYs-SI for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 03/45] xfs: separate CIL commit record IO Date: Fri, 5 Mar 2021 16:11:01 +1100 Message-Id: <20210305051143.182133-4-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=6GbAA9ItjH-Z6eHFJ-YA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To allow for iclog IO device cache flush behaviour to be optimised, we first need to separate out the commit record iclog IO from the rest of the checkpoint so we can wait for the checkpoint IO to complete before we issue the commit record. This separation is only necessary if the commit record is being written into a different iclog to the start of the checkpoint as the upcoming cache flushing changes requires completion ordering against the other iclogs submitted by the checkpoint. If the entire checkpoint and commit is in the one iclog, then they are both covered by the one set of cache flush primitives on the iclog and hence there is no need to separate them for ordering. Otherwise, we need to wait for all the previous iclogs to complete so they are ordered correctly and made stable by the REQ_PREFLUSH that the commit record iclog IO issues. This guarantees that if a reader sees the commit record in the journal, they will also see the entire checkpoint that commit record closes off. This also provides the guarantee that when the commit record IO completes, we can safely unpin all the log items in the checkpoint so they can be written back because the entire checkpoint is stable in the journal. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Chandan Babu R Reviewed-by: Brian Foster --- fs/xfs/xfs_log.c | 8 +++++--- fs/xfs/xfs_log_cil.c | 9 +++++++++ fs/xfs/xfs_log_priv.h | 2 ++ 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index fa284f26d10e..317c466232d4 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -784,10 +784,12 @@ xfs_log_mount_cancel( } /* - * Wait for the iclog to be written disk, or return an error if the log has been - * shut down. + * Wait for the iclog and all prior iclogs to be written disk as required by the + * log force state machine. Waiting on ic_force_wait ensures iclog completions + * have been ordered and callbacks run before we are woken here, hence + * guaranteeing that all the iclogs up to this one are on stable storage. */ -static int +int xlog_wait_on_iclog( struct xlog_in_core *iclog) __releases(iclog->ic_log->l_icloglock) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b0ef071b3cb5..1e5fd6f268c2 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -870,6 +870,15 @@ xlog_cil_push_work( wake_up_all(&cil->xc_commit_wait); spin_unlock(&cil->xc_push_lock); + /* + * If the checkpoint spans multiple iclogs, wait for all previous + * iclogs to complete before we submit the commit_iclog. + */ + if (ctx->start_lsn != commit_lsn) { + spin_lock(&log->l_icloglock); + xlog_wait_on_iclog(commit_iclog->ic_prev); + } + /* release the hounds! */ xfs_log_release_iclog(commit_iclog); return; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 037950cf1061..ee7786b33da9 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -584,6 +584,8 @@ xlog_wait( remove_wait_queue(wq, &wait); } +int xlog_wait_on_iclog(struct xlog_in_core *iclog); + /* * The LSN is valid so long as it is behind the current LSN. If it isn't, this * means that the next log record that includes this metadata could have a From patchwork Fri Mar 5 05:11:02 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117733 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77340C433E9 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5345365013 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229497AbhCEFMx (ORCPT ); Fri, 5 Mar 2021 00:12:53 -0500 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:48691 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbhCEFMM (ORCPT ); Fri, 5 Mar 2021 00:12:12 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id B2C791ADD91 for ; Fri, 5 Mar 2021 16:11:55 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbnq-4i for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lYv-TK for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Date: Fri, 5 Mar 2021 16:11:02 +1100 Message-Id: <20210305051143.182133-5-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=Lz9GblgEbu-RQyWGrh4A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner It's a one line wrapper around blkdev_issue_flush(). Just replace it with direct calls to blkdev_issue_flush(). Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_buf.c | 2 +- fs/xfs/xfs_file.c | 6 +++--- fs/xfs/xfs_log.c | 2 +- fs/xfs/xfs_super.c | 7 ------- fs/xfs/xfs_super.h | 1 - 5 files changed, 5 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 37a1d12762d8..7043546a04b8 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1958,7 +1958,7 @@ xfs_free_buftarg( percpu_counter_destroy(&btp->bt_io_count); list_lru_destroy(&btp->bt_lru); - xfs_blkdev_issue_flush(btp); + blkdev_issue_flush(btp->bt_bdev); kmem_free(btp); } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index a007ca0711d9..24c7f45fc4eb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -197,9 +197,9 @@ xfs_file_fsync( * inode size in case of an extending write. */ if (XFS_IS_REALTIME_INODE(ip)) - xfs_blkdev_issue_flush(mp->m_rtdev_targp); + blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev); else if (mp->m_logdev_targp != mp->m_ddev_targp) - xfs_blkdev_issue_flush(mp->m_ddev_targp); + blkdev_issue_flush(mp->m_ddev_targp->bt_bdev); /* * Any inode that has dirty modifications in the log is pinned. The @@ -219,7 +219,7 @@ xfs_file_fsync( */ if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) && mp->m_logdev_targp == mp->m_ddev_targp) - xfs_blkdev_issue_flush(mp->m_ddev_targp); + blkdev_issue_flush(mp->m_ddev_targp->bt_bdev); return error; } diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 317c466232d4..fee76c485727 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1962,7 +1962,7 @@ xlog_sync( * layer state machine for preflushes. */ if (log->l_targ != log->l_mp->m_ddev_targp || split) { - xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp); + blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev); need_flush = false; } diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e5e0713bebcd..ca2cb0448b5e 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -339,13 +339,6 @@ xfs_blkdev_put( blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); } -void -xfs_blkdev_issue_flush( - xfs_buftarg_t *buftarg) -{ - blkdev_issue_flush(buftarg->bt_bdev); -} - STATIC void xfs_close_devices( struct xfs_mount *mp) diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h index 1ca484b8357f..79cb2dece811 100644 --- a/fs/xfs/xfs_super.h +++ b/fs/xfs/xfs_super.h @@ -88,7 +88,6 @@ struct block_device; extern void xfs_quiesce_attr(struct xfs_mount *mp); extern void xfs_flush_inodes(struct xfs_mount *mp); -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *); extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *, xfs_agnumber_t agcount); From patchwork Fri Mar 5 05:11:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117677 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03EEEC433E0 for ; Fri, 5 Mar 2021 05:11:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CFDA165013 for ; Fri, 5 Mar 2021 05:11:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229523AbhCEFLw (ORCPT ); Fri, 5 Mar 2021 00:11:52 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36722 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229446AbhCEFLw (ORCPT ); Fri, 5 Mar 2021 00:11:52 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 18D97ECB205 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbns-5T for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lYy-UD for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 05/45] xfs: async blkdev cache flush Date: Fri, 5 Mar 2021 16:11:03 +1100 Message-Id: <20210305051143.182133-6-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=SfSGQrjV1alRuBDLQEQA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The new checkpoint caceh flush mechanism requires us to issue an unconditional cache flush before we start a new checkpoint. We don't want to block for this if we can help it, and we have a fair chunk of CPU work to do between starting the checkpoint and issuing the first journal IO. Hence it makes sense to amortise the latency cost of the cache flush by issuing it asynchronously and then waiting for it only when we need to issue the first IO in the transaction. TO do this, we need async cache flush primitives to submit the cache flush bio and to wait on it. THe block layer has no such primitives for filesystems, so roll our own for the moment. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_bio_io.c | 36 ++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_linux.h | 2 ++ 2 files changed, 38 insertions(+) diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c index 17f36db2f792..668f8bd27b4a 100644 --- a/fs/xfs/xfs_bio_io.c +++ b/fs/xfs/xfs_bio_io.c @@ -9,6 +9,42 @@ static inline unsigned int bio_max_vecs(unsigned int count) return bio_max_segs(howmany(count, PAGE_SIZE)); } +void +xfs_flush_bdev_async_endio( + struct bio *bio) +{ + if (bio->bi_private) + complete(bio->bi_private); +} + +/* + * Submit a request for an async cache flush to run. If the request queue does + * not require flush operations, just skip it altogether. If the caller needsi + * to wait for the flush completion at a later point in time, they must supply a + * valid completion. This will be signalled when the flush completes. The + * caller never sees the bio that is issued here. + */ +void +xfs_flush_bdev_async( + struct bio *bio, + struct block_device *bdev, + struct completion *done) +{ + struct request_queue *q = bdev->bd_disk->queue; + + if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { + complete(done); + return; + } + + bio_init(bio, NULL, 0); + bio_set_dev(bio, bdev); + bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; + bio->bi_private = done; + bio->bi_end_io = xfs_flush_bdev_async_endio; + + submit_bio(bio); +} int xfs_rw_bdev( struct block_device *bdev, diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index af6be9b9ccdf..953d98bc4832 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y) int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count, char *data, unsigned int op); +void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev, + struct completion *done); #define ASSERT_ALWAYS(expr) \ (likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__)) From patchwork Fri Mar 5 05:11:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117701 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB6ACC433DB for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8766A64FD4 for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229559AbhCEFMR (ORCPT ); Fri, 5 Mar 2021 00:12:17 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:57739 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229578AbhCEFMA (ORCPT ); Fri, 5 Mar 2021 00:12:00 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 1871B828018 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbnu-6W for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lZ1-V7 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:49 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Date: Fri, 5 Mar 2021 16:11:04 +1100 Message-Id: <20210305051143.182133-7-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=VwQbUJbxAAAA:8 a=gwj23wwd_E3P4KkB8XMA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. These rules apply to the atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). The ordering guarantees of #1 are currently provided by REQ_PREFLUSH being added to every iclog IO. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. However, for long running CIL checkpoints that might do a thousand journal IOs, we don't need every single one of these iclog IOs to issue a cache flush - the cache flush done before the first iclog is submitted is sufficient to cover the entire range in the log that the checkpoint will overwrite because the CIL space reservation guarantees the tail of the log (completed metadata) is already beyond the range of the checkpoint write. Hence we only need a full cache flush between closing off the CIL checkpoint context (i.e. when the push switches it out) and issuing the first journal IO. Rather than plumbing this through to the journal IO, we can start this cache flush the moment the CIL context is owned exclusively by the push worker. The cache flush can be in progress while we process the CIL ready for writing, hence reducing the latency of the initial iclog write. This is especially true for large checkpoints, where we might have to process hundreds of thousands of log vectors before we issue the first iclog write. In these cases, it is likely the cache flush has already been completed by the time we have built the CIL log vector chain. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_log_cil.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 1e5fd6f268c2..b4cdb8b6c4c3 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -656,6 +656,8 @@ xlog_cil_push_work( struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; + struct bio bio; + DECLARE_COMPLETION_ONSTACK(bdev_flush); new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -719,10 +721,25 @@ xlog_cil_push_work( spin_unlock(&cil->xc_push_lock); /* - * pull all the log vectors off the items in the CIL, and - * remove the items from the CIL. We don't need the CIL lock - * here because it's only needed on the transaction commit - * side which is currently locked out by the flush lock. + * The CIL is stable at this point - nothing new will be added to it + * because we hold the flush lock exclusively. Hence we can now issue + * a cache flush to ensure all the completed metadata in the journal we + * are about to overwrite is on stable storage. + * + * This avoids the need to have the iclogs issue REQ_PREFLUSH based + * cache flushes to provide this ordering guarantee, and hence for CIL + * checkpoints that require hundreds or thousands of log writes no + * longer need to issue device cache flushes to provide metadata + * writeback ordering. + */ + xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, + &bdev_flush); + + /* + * Pull all the log vectors off the items in the CIL, and remove the + * items from the CIL. We don't need the CIL lock here because it's only + * needed on the transaction commit side which is currently locked out + * by the flush lock. */ lv = NULL; num_iovecs = 0; @@ -806,6 +823,12 @@ xlog_cil_push_work( lvhdr.lv_iovecp = &lhdr; lvhdr.lv_next = ctx->lv_chain; + /* + * Before we format and submit the first iclog, we have to ensure that + * the metadata writeback ordering cache flush is complete. + */ + wait_for_completion(&bdev_flush); + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); if (error) goto out_abort_free_ticket; From patchwork Fri Mar 5 05:11:05 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117681 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF378C433E6 for ; Fri, 5 Mar 2021 05:11:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 942E665013 for ; Fri, 5 Mar 2021 05:11:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229497AbhCEFL5 (ORCPT ); Fri, 5 Mar 2021 00:11:57 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36732 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229500AbhCEFLw (ORCPT ); Fri, 5 Mar 2021 00:11:52 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id DA277ECB1DE for ; Fri, 5 Mar 2021 16:11:50 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbo3-8C for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kf-000lZ4-WA for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Date: Fri, 5 Mar 2021 16:11:05 +1100 Message-Id: <20210305051143.182133-8-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=VwQbUJbxAAAA:8 a=U2sKdf1BhhSb6LhUA4oA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The CIL push is the only call to xlog_write that sets this variable to true. The other callers don't need a start rec, and they tell xlog_write what to do by passing the type of ophdr they need written in the flags field. The need_start_rec parameter essentially tells xlog_write to to write an extra ophdr with a XLOG_START_TRANS type, so get rid of the variable to do this and pass XLOG_START_TRANS as the flag value into xlog_write() from the CIL push. $ size fs/xfs/xfs_log.o* text data bss dec hex filename 27595 560 8 28163 6e03 fs/xfs/xfs_log.o.orig 27454 560 8 28022 6d76 fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 44 +++++++++++++++++++++---------------------- fs/xfs/xfs_log_cil.c | 3 ++- fs/xfs/xfs_log_priv.h | 3 +-- 3 files changed, 25 insertions(+), 25 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index fee76c485727..364694a83de6 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -818,9 +818,7 @@ xlog_wait_on_iclog( static int xlog_write_unmount_record( struct xlog *log, - struct xlog_ticket *ticket, - xfs_lsn_t *lsn, - uint flags) + struct xlog_ticket *ticket) { struct xfs_unmount_log_format ulf = { .magic = XLOG_UNMOUNT_TYPE, @@ -837,7 +835,7 @@ xlog_write_unmount_record( /* account for space used by record data */ ticket->t_curr_res -= sizeof(ulf); - return xlog_write(log, &vec, ticket, lsn, NULL, flags, false); + return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); } /* @@ -851,15 +849,13 @@ xlog_unmount_write( struct xfs_mount *mp = log->l_mp; struct xlog_in_core *iclog; struct xlog_ticket *tic = NULL; - xfs_lsn_t lsn; - uint flags = XLOG_UNMOUNT_TRANS; int error; error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); if (error) goto out_err; - error = xlog_write_unmount_record(log, tic, &lsn, flags); + error = xlog_write_unmount_record(log, tic); /* * At this point, we're umounting anyway, so there's no point in * transitioning log state to IOERROR. Just continue... @@ -1551,8 +1547,7 @@ xlog_commit_record( if (XLOG_FORCED_SHUTDOWN(log)) return -EIO; - error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS, - false); + error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -2149,13 +2144,16 @@ static int xlog_write_calc_vec_length( struct xlog_ticket *ticket, struct xfs_log_vec *log_vector, - bool need_start_rec) + uint optype) { struct xfs_log_vec *lv; - int headers = need_start_rec ? 1 : 0; + int headers = 0; int len = 0; int i; + if (optype & XLOG_START_TRANS) + headers++; + for (lv = log_vector; lv; lv = lv->lv_next) { /* we don't write ordered log vectors */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) @@ -2375,8 +2373,7 @@ xlog_write( struct xlog_ticket *ticket, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, - uint flags, - bool need_start_rec) + uint optype) { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv = log_vector; @@ -2404,8 +2401,9 @@ xlog_write( xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); } - len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec); - *start_lsn = 0; + len = xlog_write_calc_vec_length(ticket, log_vector, optype); + if (start_lsn) + *start_lsn = 0; while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { void *ptr; int log_offset; @@ -2419,7 +2417,7 @@ xlog_write( ptr = iclog->ic_datap + log_offset; /* start_lsn is the first lsn written to. That's all we need. */ - if (!*start_lsn) + if (start_lsn && !*start_lsn) *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); /* @@ -2432,6 +2430,7 @@ xlog_write( int copy_len; int copy_off; bool ordered = false; + bool wrote_start_rec = false; /* ordered log vectors have no regions to write */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) { @@ -2449,13 +2448,15 @@ xlog_write( * write a start record. Only do this for the first * iclog we write to. */ - if (need_start_rec) { + if (optype & XLOG_START_TRANS) { xlog_write_start_rec(ptr, ticket); xlog_write_adv_cnt(&ptr, &len, &log_offset, sizeof(struct xlog_op_header)); + optype &= ~XLOG_START_TRANS; + wrote_start_rec = true; } - ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags); + ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype); if (!ophdr) return -EIO; @@ -2486,14 +2487,13 @@ xlog_write( } copy_len += sizeof(struct xlog_op_header); record_cnt++; - if (need_start_rec) { + if (wrote_start_rec) { copy_len += sizeof(struct xlog_op_header); record_cnt++; - need_start_rec = false; } data_cnt += contwr ? copy_len : 0; - error = xlog_write_copy_finish(log, iclog, flags, + error = xlog_write_copy_finish(log, iclog, optype, &record_cnt, &data_cnt, &partial_copy, &partial_copy_len, @@ -2537,7 +2537,7 @@ xlog_write( spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); if (commit_iclog) { - ASSERT(flags & XLOG_COMMIT_TRANS); + ASSERT(optype & XLOG_COMMIT_TRANS); *commit_iclog = iclog; } else { error = xlog_state_release_iclog(log, iclog); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b4cdb8b6c4c3..c04d5d37a3a2 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -829,7 +829,8 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush); - error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, + XLOG_START_TRANS); if (error) goto out_abort_free_ticket; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ee7786b33da9..56e1942c47df 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -480,8 +480,7 @@ void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_ticket *tic, xfs_lsn_t *start_lsn, - struct xlog_in_core **commit_iclog, uint flags, - bool need_start_rec); + struct xlog_in_core **commit_iclog, uint optype); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); From patchwork Fri Mar 5 05:11:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117697 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4D92C433E6 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C6A7465013 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229538AbhCEFMR (ORCPT ); Fri, 5 Mar 2021 00:12:17 -0500 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:48033 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229552AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 1905F1ADB59 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbo6-9U for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZ8-1Z for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 08/45] xfs: journal IO cache flush reductions Date: Fri, 5 Mar 2021 16:11:06 +1100 Message-Id: <20210305051143.182133-9-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=Q6LLlgmf-1NZ6pUK8LoA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. The ordering guarantees of #1 are provided by REQ_PREFLUSH. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. The ordering guarantees of #2 are provided by the REQ_FUA, which ensures the journal writes do not complete until they are on stable storage. Hence by the time the last journal IO in a checkpoint completes, we know that the entire checkpoint is on stable storage and we can unpin the dirty metadata and allow it to be written back. This is the mechanism by which ordering was first implemented in XFS way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 ("Add support for drive write cache flushing") in the xfs-archive tree. A lot has changed since then, most notably we now use delayed logging to checkpoint the filesystem to the journal rather than write each individual transaction to the journal. Cache flushes on journal IO are necessary when individual transactions are wholly contained within a single iclog. However, CIL checkpoints are single transactions that typically span hundreds to thousands of individual journal writes, and so the requirements for device cache flushing have changed. That is, the ordering rules I state above apply to ordering of atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). Hence we only need a REQ_PREFLUSH on the journal IO that starts a new journal transaction to provide #1, and it is not on any other journal IO done within the context of that journal transaction. The CIL checkpoint already issues a cache flush before it starts writing to the log, so we no longer need the iclog IO to issue a REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed to xlog_write(), we no longer need to mark the first iclog in the log write with REQ_PREFLUSH for this case. As an added bonus, this ordering mechanism works for both internal and external logs, meaning we can remove the explicit data device cache flushes from the iclog write code when using external logs. Given the new ordering semantics of commit records for the CIL, we need iclogs containing commit records to issue a REQ_PREFLUSH. We also require unmount records to do this. Hence for both XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark the first iclog being written with REQ_PREFLUSH. For both commit records and unmount records, we also want them immediately on stable storage, so we want to also mark the iclogs that contain these records to be marked REQ_FUA. That means if a record is split across multiple iclogs, they are all marked REQ_FUA and not just the last one so that when the transaction is completed all the parts of the record are on stable storage. And for external logs, unmount records need a pre-write data device cache flush similar to the CIL checkpoint cache pre-flush as the internal iclog write code does not do this implicitly anymore. As an optimisation, when the commit record lands in the same iclog as the journal transaction starts, we don't need to wait for anything and can simply use REQ_FUA to provide guarantee #2. This means that for fsync() heavy workloads, the cache flush behaviour is completely unchanged and there is no degradation in performance as a result of optimise the multi-IO transaction case. The most notable sign that there is less IO latency on my test machine (nvme SSDs) is that the "noiclogs" rate has dropped substantially. This metric indicates that the CIL push is blocking in xlog_get_iclog_space() waiting for iclog IO completion to occur. With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space() is blocking waiting for log IO. With the changes in this patch, this drops to 1 noiclog event for every 100 iclog writes. Hence it is clear that log IO is completing much faster than it was previously, but it is also clear that for large iclog sizes, this isn't the performance limiting factor on this hardware. With smaller iclogs (32kB), however, there is a sustantial difference. With the cache flush modifications, the journal is now running at over 4000 write IOPS, and the journal throughput is largely identical to the 256kB iclogs and the noiclog event rate stays low at about 1:50 iclog writes. The existing code tops out at about 2500 IOPS as the number of cache flushes dominate performance and latency. The noiclog event rate is about 1:4, and the performance variance is quite large as the journal throughput can fall to less than half the peak sustained rate when the cache flush rate prevents metadata writeback from keeping up and the log runs out of space and throttles reservations. As a result: logbsize fsmark create rate rm -rf before 32kb 152851+/-5.3e+04 5m28s patched 32kb 221533+/-1.1e+04 5m24s before 256kb 220239+/-6.2e+03 4m58s patched 256kb 228286+/-9.2e+03 5m06s The rm -rf times are included because I ran them, but the differences are largely noise. This workload is largely metadata read IO latency bound and the changes to the journal cache flushing doesn't really make any noticable difference to behaviour apart from a reduction in noiclog events from background CIL pushing. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R --- fs/xfs/xfs_log.c | 53 +++++++++++++++++++++++-------------------- fs/xfs/xfs_log_cil.c | 7 +++++- fs/xfs/xfs_log_priv.h | 4 ++++ 3 files changed, 38 insertions(+), 26 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 364694a83de6..ed44d67d7099 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -835,6 +835,14 @@ xlog_write_unmount_record( /* account for space used by record data */ ticket->t_curr_res -= sizeof(ulf); + + /* + * For external log devices, we need to flush the data device cache + * first to ensure all metadata writeback is on stable storage before we + * stamp the tail LSN into the unmount record. + */ + if (log->l_targ != log->l_mp->m_ddev_targp) + blkdev_issue_flush(log->l_targ->bt_bdev); return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); } @@ -1753,8 +1761,7 @@ xlog_write_iclog( struct xlog *log, struct xlog_in_core *iclog, uint64_t bno, - unsigned int count, - bool need_flush) + unsigned int count) { ASSERT(bno < log->l_logBBsize); @@ -1792,10 +1799,12 @@ xlog_write_iclog( * writeback throttle from throttling log writes behind background * metadata writeback and causing priority inversions. */ - iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | - REQ_IDLE | REQ_FUA; - if (need_flush) + iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE; + if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) iclog->ic_bio.bi_opf |= REQ_PREFLUSH; + if (iclog->ic_flags & XLOG_ICL_NEED_FUA) + iclog->ic_bio.bi_opf |= REQ_FUA; + iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) { xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); @@ -1898,7 +1907,6 @@ xlog_sync( unsigned int roundoff; /* roundoff to BB or stripe */ uint64_t bno; unsigned int size; - bool need_flush = true, split = false; ASSERT(atomic_read(&iclog->ic_refcnt) == 0); @@ -1923,10 +1931,8 @@ xlog_sync( bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn)); /* Do we need to split this write into 2 parts? */ - if (bno + BTOBB(count) > log->l_logBBsize) { + if (bno + BTOBB(count) > log->l_logBBsize) xlog_split_iclog(log, &iclog->ic_header, bno, count); - split = true; - } /* calculcate the checksum */ iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header, @@ -1947,22 +1953,8 @@ xlog_sync( be64_to_cpu(iclog->ic_header.h_lsn)); } #endif - - /* - * Flush the data device before flushing the log to make sure all meta - * data written back from the AIL actually made it to disk before - * stamping the new log tail LSN into the log buffer. For an external - * log we need to issue the flush explicitly, and unfortunately - * synchronously here; for an internal log we can simply use the block - * layer state machine for preflushes. - */ - if (log->l_targ != log->l_mp->m_ddev_targp || split) { - blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev); - need_flush = false; - } - xlog_verify_iclog(log, iclog, count); - xlog_write_iclog(log, iclog, bno, count, need_flush); + xlog_write_iclog(log, iclog, bno, count); } /* @@ -2416,10 +2408,21 @@ xlog_write( ASSERT(log_offset <= iclog->ic_size - 1); ptr = iclog->ic_datap + log_offset; - /* start_lsn is the first lsn written to. That's all we need. */ + /* Start_lsn is the first lsn written to. */ if (start_lsn && !*start_lsn) *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); + /* + * iclogs containing commit records or unmount records need + * to issue ordering cache flushes and commit immediately + * to stable storage to guarantee journal vs metadata ordering + * is correctly maintained in the storage media. + */ + if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) { + iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | + XLOG_ICL_NEED_FUA); + } + /* * This loop writes out as many regions as can fit in the amount * of space which was allocated by xlog_state_get_iclog_space(). diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index c04d5d37a3a2..263c8d907221 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -896,11 +896,16 @@ xlog_cil_push_work( /* * If the checkpoint spans multiple iclogs, wait for all previous - * iclogs to complete before we submit the commit_iclog. + * iclogs to complete before we submit the commit_iclog. If it is in the + * same iclog as the start of the checkpoint, then we can skip the iclog + * cache flush because there are no other iclogs we need to order + * against. */ if (ctx->start_lsn != commit_lsn) { spin_lock(&log->l_icloglock); xlog_wait_on_iclog(commit_iclog->ic_prev); + } else { + commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH; } /* release the hounds! */ diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 56e1942c47df..0552e96d2b64 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -133,6 +133,9 @@ enum xlog_iclog_state { #define XLOG_COVER_OPS 5 +#define XLOG_ICL_NEED_FLUSH (1 << 0) /* iclog needs REQ_PREFLUSH */ +#define XLOG_ICL_NEED_FUA (1 << 1) /* iclog needs REQ_FUA */ + /* Ticket reservation region accounting */ #define XLOG_TIC_LEN_MAX 15 @@ -201,6 +204,7 @@ typedef struct xlog_in_core { u32 ic_size; u32 ic_offset; enum xlog_iclog_state ic_state; + unsigned int ic_flags; char *ic_datap; /* pointer to iclog data */ /* Callback structures need their own cacheline */ From patchwork Fri Mar 5 05:11:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117779 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B792C433E0 for ; Fri, 5 Mar 2021 05:29:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4D9EC65005 for ; Fri, 5 Mar 2021 05:29:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229446AbhCEF3n (ORCPT ); Fri, 5 Mar 2021 00:29:43 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:38680 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229448AbhCEF3m (ORCPT ); Fri, 5 Mar 2021 00:29:42 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 5FD84ECB0AE for ; Fri, 5 Mar 2021 16:29:41 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbo9-Ad for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZB-2r for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Date: Fri, 5 Mar 2021 16:11:07 +1100 Message-Id: <20210305051143.182133-10-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=VwQbUJbxAAAA:8 a=jMqSS0pQBfBzRtNEyY0A:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner A hang with tasks stuck on the CIL hard throttle was reported and largely diagnosed by Donald Buczek, who discovered that it was a result of the CIL context space usage decrementing in committed transactions once the hard throttle limit had been hit and processes were already blocked. This resulted in the CIL push not waking up those waiters because the CIL context was no longer over the hard throttle limit. The surprising aspect of this was the CIL space usage going backwards regularly enough to trigger this situation. Assumptions had been made in design that the relogging process would only increase the size of the objects in the CIL, and so that space would only increase. This change and commit message fixes the issue and documents the result of an audit of the triggers that can cause the CIL space to go backwards, how large the backwards steps tend to be, the frequency in which they occur, and what the impact on the CIL accounting code is. Even though the CIL ctx->space_used can go backwards, it will only do so if the log item is already logged to the CIL and contains a space reservation for it's entire logged state. This is tracked by the shadow buffer state on the log item. If the item is not previously logged in the CIL it has no shadow buffer nor log vector, and hence the entire size of the logged item copied to the log vector is accounted to the CIL space usage. i.e. it will always go up in this case. If the item has a log vector (i.e. already in the CIL) and the size decreases, then the existing log vector will be overwritten and the space usage will go down. This is the only condition where the space usage reduces, and it can only occur when an item is already tracked in the CIL. Hence we are safe from CIL space usage underruns as a result of log items decreasing in size when they are relogged. Typically this reduction in CIL usage occurs from metadata blocks being free, such as when a btree block merge occurs or a directory enter/xattr entry is removed and the da-tree is reduced in size. This generally results in a reduction in size of around a single block in the CIL, but also tends to increase the number of log vectors because the parent and sibling nodes in the tree needs to be updated when a btree block is removed. If a multi-level merge occurs, then we see reduction in size of 2+ blocks, but again the log vector count goes up. The other vector is inode fork size changes, which only log the current size of the fork and ignore the previously logged size when the fork is relogged. Hence if we are removing items from the inode fork (dir/xattr removal in shortform, extent record removal in extent form, etc) the relogged size of the inode for can decrease. No other log items can decrease in size either because they are a fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging an intent actually creates a new intent log item and doesn't relog the old item at all.) Hence the only two vectors for CIL context size reduction are relogging inode forks and marking buffers active in the CIL as stale. Long story short: the majority of the code does the right thing and handles the reduction in log item size correctly, and only the CIL hard throttle implementation is problematic and needs fixing. This patch makes that fix, as well as adds comments in the log item code that result in items shrinking in size when they are relogged as a clear reminder that this can and does happen frequently. The throttle fix is based upon the change Donald proposed, though it goes further to ensure that once the throttle is activated, it captures all tasks until the CIL push issues a wakeup, regardless of whether the CIL space used has gone back under the throttle threshold. This ensures that we prevent tasks reducing the CIL slightly under the throttle threshold and then making more changes that push it well over the throttle limit. This is acheived by checking if the throttle wait queue is already active as a condition of throttling. Hence once we start throttling, we continue to apply the throttle until the CIL context push wakes everything on the wait queue. We can use waitqueue_active() for the waitqueue manipulations and checks as they are all done under the ctx->xc_push_lock. Hence the waitqueue has external serialisation and we can safely peek inside the wait queue without holding the internal waitqueue locks. Many thanks to Donald for his diagnostic and analysis work to isolate the cause of this hang. Reported-and-tested-by: Donald Buczek Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 37 ++++++++++++++++++------------------- fs/xfs/xfs_inode_item.c | 14 ++++++++++++++ fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++----- 3 files changed, 49 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index dc0be2a639cc..17960b1ce5ef 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -56,14 +56,12 @@ xfs_buf_log_format_size( } /* - * This returns the number of log iovecs needed to log the - * given buf log item. + * Return the number of log iovecs and space needed to log the given buf log + * item segment. * - * It calculates this as 1 iovec for the buf log format structure - * and 1 for each stretch of non-contiguous chunks to be logged. - * Contiguous chunks are logged in a single iovec. - * - * If the XFS_BLI_STALE flag has been set, then log nothing. + * It calculates this as 1 iovec for the buf log format structure and 1 for each + * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged + * in a single iovec. */ STATIC void xfs_buf_item_size_segment( @@ -119,11 +117,8 @@ xfs_buf_item_size_segment( } /* - * This returns the number of log iovecs needed to log the given buf log item. - * - * It calculates this as 1 iovec for the buf log format structure and 1 for each - * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged - * in a single iovec. + * Return the number of log iovecs and space needed to log the given buf log + * item. * * Discontiguous buffers need a format structure per region that is being * logged. This makes the changes in the buffer appear to log recovery as though @@ -133,7 +128,11 @@ xfs_buf_item_size_segment( * what ends up on disk. * * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log - * format structures. + * format structures. If the item has previously been logged and has dirty + * regions, we do not relog them in stale buffers. This has the effect of + * reducing the size of the relogged item by the amount of dirty data tracked + * by the log item. This can result in the committing transaction reducing the + * amount of space being consumed by the CIL. */ STATIC void xfs_buf_item_size( @@ -147,9 +146,9 @@ xfs_buf_item_size( ASSERT(atomic_read(&bip->bli_refcount) > 0); if (bip->bli_flags & XFS_BLI_STALE) { /* - * The buffer is stale, so all we need to log - * is the buf log format structure with the - * cancel flag in it. + * The buffer is stale, so all we need to log is the buf log + * format structure with the cancel flag in it as we are never + * going to replay the changes tracked in the log item. */ trace_xfs_buf_item_size_stale(bip); ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL); @@ -164,9 +163,9 @@ xfs_buf_item_size( if (bip->bli_flags & XFS_BLI_ORDERED) { /* - * The buffer has been logged just to order it. - * It is not being included in the transaction - * commit, so no vectors are used at all. + * The buffer has been logged just to order it. It is not being + * included in the transaction commit, so no vectors are used at + * all. */ trace_xfs_buf_item_size_ordered(bip); *nvecs = XFS_LOG_VEC_ORDERED; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 17e20a6d8b4e..6ff91e5bf3cd 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) return container_of(lip, struct xfs_inode_log_item, ili_item); } +/* + * The logged size of an inode fork is always the current size of the inode + * fork. This means that when an inode fork is relogged, the size of the logged + * region is determined by the current state, not the combination of the + * previously logged state + the current state. This is different relogging + * behaviour to most other log items which will retain the size of the + * previously logged changes when smaller regions are relogged. + * + * Hence operations that remove data from the inode fork (e.g. shortform + * dir/attr remove, extent form extent removal, etc), the size of the relogged + * inode gets -smaller- rather than stays the same size as the previously logged + * size and this can result in the committing transaction reducing the amount of + * space being consumed by the CIL. + */ STATIC void xfs_inode_item_data_fork_size( struct xfs_inode_log_item *iip, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 263c8d907221..2f0adc35d8ec 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -670,9 +670,14 @@ xlog_cil_push_work( ASSERT(push_seq <= ctx->sequence); /* - * Wake up any background push waiters now this context is being pushed. + * As we are about to switch to a new, empty CIL context, we no longer + * need to throttle tasks on CIL space overruns. Wake any waiters that + * the hard push throttle may have caught so they can start committing + * to the new context. The ctx->xc_push_lock provides the serialisation + * necessary for safely using the lockless waitqueue_active() check in + * this context. */ - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) + if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); /* @@ -945,7 +950,7 @@ xlog_cil_push_background( ASSERT(!list_empty(&cil->xc_cil)); /* - * don't do a background push if we haven't used up all the + * Don't do a background push if we haven't used up all the * space available yet. */ if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { @@ -969,9 +974,16 @@ xlog_cil_push_background( /* * If we are well over the space limit, throttle the work that is being - * done until the push work on this context has begun. + * done until the push work on this context has begun. Enforce the hard + * throttle on all transaction commits once it has been activated, even + * if the committing transactions have resulted in the space usage + * dipping back down under the hard limit. + * + * The ctx->xc_push_lock provides the serialisation necessary for safely + * using the lockless waitqueue_active() check in this context. */ - if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) || + waitqueue_active(&cil->xc_push_wait)) { trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); ASSERT(cil->xc_ctx->space_used < log->l_logsize); xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock); From patchwork Fri Mar 5 05:11:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117679 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4799AC433DB for ; Fri, 5 Mar 2021 05:11:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 145B065017 for ; Fri, 5 Mar 2021 05:11:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229446AbhCEFLz (ORCPT ); Fri, 5 Mar 2021 00:11:55 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40392 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229497AbhCEFLw (ORCPT ); Fri, 5 Mar 2021 00:11:52 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id DA4AF78AFD1 for ; Fri, 5 Mar 2021 16:11:50 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboC-CB for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZE-3y for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 10/45] xfs: reduce buffer log item shadow allocations Date: Fri, 5 Mar 2021 16:11:08 +1100 Message-Id: <20210305051143.182133-11-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=VwQbUJbxAAAA:8 a=FNdSPaME_fANvIEDIncA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When we modify btrees repeatedly, we regularly increase the size of the logged region by a single chunk at a time (per transaction commit). This results in the CIL formatting code having to reallocate the log vector buffer every time the buffer dirty region grows. Hence over a typical 4kB btree buffer, we might grow the log vector 4096/128 = 32x over a short period where we repeatedly add or remove records to/from the buffer over a series of running transaction. This means we are doing 32 memory allocations and frees over this time during a performance critical path in the journal. The amount of space tracked in the CIL for the object is calculated during the ->iop_format() call for the buffer log item, but the buffer memory allocated for it is calculated by the ->iop_size() call. The size callout determines the size of the buffer, the format call determines the space used in the buffer. Hence we can oversize the buffer space required in the size calculation without impacting the amount of space used and accounted to the CIL for the changes being logged. This allows us to reduce the number of allocations by rounding up the buffer size to allow for future growth. This can safe a substantial amount of CPU time in this path: - 46.52% 2.02% [kernel] [k] xfs_log_commit_cil - 44.49% xfs_log_commit_cil - 30.78% _raw_spin_lock - 30.75% do_raw_spin_lock 30.27% __pv_queued_spin_lock_slowpath (oh, ouch!) .... - 1.05% kmem_alloc_large - 1.02% kmem_alloc 0.94% __kmalloc This overhead here us what this patch is aimed at. After: - 0.76% kmem_alloc_large - 0.75% kmem_alloc 0.70% __kmalloc The size of 512 bytes is based on the bitmap chunk size being 128 bytes and that random directory entry updates almost never require more than 3-4 128 byte regions to be logged in the directory block. The other observation is for per-ag btrees. When we are inserting into a new btree block, we'll pack it from the front. Hence the first few records land in the first 128 bytes so we log only 128 bytes, the next 8-16 records land in the second region so now we log 256 bytes. And so on. If we are doing random updates, it will only allocate every 4 random 128 byte regions that are dirtied instead of every single one. Any larger than 512 bytes and I noticed an increase in memory footprint in my scalability workloads. Any less than this and I didn't really see any significant benefit to CPU usage. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 17960b1ce5ef..0628a65d9c55 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -142,6 +142,7 @@ xfs_buf_item_size( { struct xfs_buf_log_item *bip = BUF_ITEM(lip); int i; + int bytes; ASSERT(atomic_read(&bip->bli_refcount) > 0); if (bip->bli_flags & XFS_BLI_STALE) { @@ -173,7 +174,7 @@ xfs_buf_item_size( } /* - * the vector count is based on the number of buffer vectors we have + * The vector count is based on the number of buffer vectors we have * dirty bits in. This will only be greater than one when we have a * compound buffer with more than one segment dirty. Hence for compound * buffers we need to track which segment the dirty bits correspond to, @@ -181,10 +182,18 @@ xfs_buf_item_size( * count for the extra buf log format structure that will need to be * written. */ + bytes = 0; for (i = 0; i < bip->bli_format_count; i++) { xfs_buf_item_size_segment(bip, &bip->bli_formats[i], - nvecs, nbytes); + nvecs, &bytes); } + + /* + * Round up the buffer size required to minimise the number of memory + * allocations that need to be done as this item grows when relogged by + * repeated modifications. + */ + *nbytes = round_up(bytes, 512); trace_xfs_buf_item_size(bip); } From patchwork Fri Mar 5 05:11:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117685 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 500FDC433E9 for ; Fri, 5 Mar 2021 05:12:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 33F4665004 for ; Fri, 5 Mar 2021 05:12:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229627AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55507 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229516AbhCEFLx (ORCPT ); Fri, 5 Mar 2021 00:11:53 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 5C39E62FAE for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboF-DE for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZH-5R for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Date: Fri, 5 Mar 2021 16:11:09 +1100 Message-Id: <20210305051143.182133-12-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=kme4I8nbBunu55xL2I8A:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Otherwise it doesn't correctly calculate the number of vectors in a logged buffer that has a contiguous map that gets split into multiple regions because the range spans discontigous memory. Probably never been hit in practice - we don't log contiguous ranges on unmapped buffers (inode clusters). Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 0628a65d9c55..91dc7d8c9739 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -55,6 +55,18 @@ xfs_buf_log_format_size( (blfp->blf_map_size * sizeof(blfp->blf_data_map[0])); } +static inline bool +xfs_buf_item_straddle( + struct xfs_buf *bp, + uint offset, + int next_bit, + int last_bit) +{ + return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) != + (xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) + + XFS_BLF_CHUNK); +} + /* * Return the number of log iovecs and space needed to log the given buf log * item segment. @@ -67,6 +79,7 @@ STATIC void xfs_buf_item_size_segment( struct xfs_buf_log_item *bip, struct xfs_buf_log_format *blfp, + uint offset, int *nvecs, int *nbytes) { @@ -101,12 +114,8 @@ xfs_buf_item_size_segment( */ if (next_bit == -1) { break; - } else if (next_bit != last_bit + 1) { - last_bit = next_bit; - (*nvecs)++; - } else if (xfs_buf_offset(bp, next_bit * XFS_BLF_CHUNK) != - (xfs_buf_offset(bp, last_bit * XFS_BLF_CHUNK) + - XFS_BLF_CHUNK)) { + } else if (next_bit != last_bit + 1 || + xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) { last_bit = next_bit; (*nvecs)++; } else { @@ -141,8 +150,10 @@ xfs_buf_item_size( int *nbytes) { struct xfs_buf_log_item *bip = BUF_ITEM(lip); + struct xfs_buf *bp = bip->bli_buf; int i; int bytes; + uint offset = 0; ASSERT(atomic_read(&bip->bli_refcount) > 0); if (bip->bli_flags & XFS_BLI_STALE) { @@ -184,8 +195,9 @@ xfs_buf_item_size( */ bytes = 0; for (i = 0; i < bip->bli_format_count; i++) { - xfs_buf_item_size_segment(bip, &bip->bli_formats[i], + xfs_buf_item_size_segment(bip, &bip->bli_formats[i], offset, nvecs, &bytes); + offset += BBTOB(bp->b_maps[i].bm_len); } /* @@ -212,18 +224,6 @@ xfs_buf_item_copy_iovec( nbits * XFS_BLF_CHUNK); } -static inline bool -xfs_buf_item_straddle( - struct xfs_buf *bp, - uint offset, - int next_bit, - int last_bit) -{ - return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) != - (xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) + - XFS_BLF_CHUNK); -} - static void xfs_buf_item_format_segment( struct xfs_buf_log_item *bip, From patchwork Fri Mar 5 05:11:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117693 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24DDBC43381 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F2A5465013 for ; Fri, 5 Mar 2021 05:12:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229520AbhCEFMQ (ORCPT ); Fri, 5 Mar 2021 00:12:16 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40384 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229528AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1853E78A588 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboI-Em for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZK-6U for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Date: Fri, 5 Mar 2021 16:11:10 +1100 Message-Id: <20210305051143.182133-13-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=VGz6NFykJe5wMMP_S7kA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We process the buf_log_item bitmap one set bit at a time with xfs_next_bit() so we can detect if a region crosses a memcpy discontinuity in the buffer data address. This has massive overhead on large buffers (e.g. 64k directory blocks) because we do a lot of unnecessary checks and xfs_buf_offset() calls. For example, 16-way concurrent create workload on debug kernel running CPU bound has this at the top of the profile at ~120k create/s on 64kb directory block size: 20.66% [kernel] [k] xfs_dir3_leaf_check_int 7.10% [kernel] [k] memcpy 6.22% [kernel] [k] xfs_next_bit 3.55% [kernel] [k] xfs_buf_offset 3.53% [kernel] [k] xfs_buf_item_format 3.34% [kernel] [k] __pv_queued_spin_lock_slowpath 3.04% [kernel] [k] do_raw_spin_lock 2.84% [kernel] [k] xfs_buf_item_size_segment.isra.0 2.31% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.36% [kernel] [k] xfs_log_commit_cil (debug checks hurt large blocks) The only buffers with discontinuities in the data address are unmapped buffers, and they are only used for inode cluster buffers and only for logging unlinked pointers. IOWs, it is -rare- that we even need to detect a discontinuity in the buffer item formatting code. Optimise all this by using xfs_contig_bits() to find the size of the contiguous regions, then test for a discontiunity inside it. If we find one, do the slow "bit at a time" method we do now. If we don't, then just copy the entire contiguous range in one go. Profile now looks like: 25.26% [kernel] [k] xfs_dir3_leaf_check_int 9.25% [kernel] [k] memcpy 5.01% [kernel] [k] __pv_queued_spin_lock_slowpath 2.84% [kernel] [k] do_raw_spin_lock 2.22% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.88% [kernel] [k] xfs_buf_find 1.53% [kernel] [k] memmove 1.47% [kernel] [k] xfs_log_commit_cil .... 0.34% [kernel] [k] xfs_buf_item_format .... 0.21% [kernel] [k] xfs_buf_offset .... 0.16% [kernel] [k] xfs_contig_bits .... 0.13% [kernel] [k] xfs_buf_item_size_segment.isra.0 So the bit scanning over for the dirty region tracking for the buffer log items is basically gone. Debug overhead hurts even more now... Perf comparison dir block creates unlink size (kb) time rate time Original 4 4m08s 220k 5m13s Original 64 7m21s 115k 13m25s Patched 4 3m59s 230k 5m03s Patched 64 6m23s 143k 12m33s Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 91dc7d8c9739..14d1fefcbf4c 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -59,12 +59,18 @@ static inline bool xfs_buf_item_straddle( struct xfs_buf *bp, uint offset, - int next_bit, - int last_bit) + int first_bit, + int nbits) { - return xfs_buf_offset(bp, offset + (next_bit << XFS_BLF_SHIFT)) != - (xfs_buf_offset(bp, offset + (last_bit << XFS_BLF_SHIFT)) + - XFS_BLF_CHUNK); + void *first, *last; + + first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT)); + last = xfs_buf_offset(bp, + offset + ((first_bit + nbits) << XFS_BLF_SHIFT)); + + if (last - first != nbits * XFS_BLF_CHUNK) + return true; + return false; } /* @@ -84,20 +90,51 @@ xfs_buf_item_size_segment( int *nbytes) { struct xfs_buf *bp = bip->bli_buf; + int first_bit; + int nbits; int next_bit; int last_bit; - last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0); - if (last_bit == -1) + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0); + if (first_bit == -1) return; - /* - * initial count for a dirty buffer is 2 vectors - the format structure - * and the first dirty region. - */ - *nvecs += 2; - *nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK; + (*nvecs)++; + *nbytes += xfs_buf_log_format_size(blfp); + + do { + nbits = xfs_contig_bits(blfp->blf_data_map, + blfp->blf_map_size, first_bit); + ASSERT(nbits > 0); + + /* + * Straddling a page is rare because we don't log contiguous + * chunks of unmapped buffers anywhere. + */ + if (nbits > 1 && + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) + goto slow_scan; + + (*nvecs)++; + *nbytes += nbits * XFS_BLF_CHUNK; + + /* + * This takes the bit number to start looking from and + * returns the next set bit from there. It returns -1 + * if there are no more bits set or the start bit is + * beyond the end of the bitmap. + */ + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, + (uint)first_bit + nbits + 1); + } while (first_bit != -1); + return; + +slow_scan: + /* Count the first bit we jumped out of the above loop from */ + (*nvecs)++; + *nbytes += XFS_BLF_CHUNK; + last_bit = first_bit; while (last_bit != -1) { /* * This takes the bit number to start looking from and @@ -115,11 +152,14 @@ xfs_buf_item_size_segment( if (next_bit == -1) { break; } else if (next_bit != last_bit + 1 || - xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) { + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) { last_bit = next_bit; + first_bit = next_bit; (*nvecs)++; + nbits = 1; } else { last_bit++; + nbits++; } *nbytes += XFS_BLF_CHUNK; } @@ -276,6 +316,38 @@ xfs_buf_item_format_segment( /* * Fill in an iovec for each set of contiguous chunks. */ + do { + ASSERT(first_bit >= 0); + nbits = xfs_contig_bits(blfp->blf_data_map, + blfp->blf_map_size, first_bit); + ASSERT(nbits > 0); + + /* + * Straddling a page is rare because we don't log contiguous + * chunks of unmapped buffers anywhere. + */ + if (nbits > 1 && + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) + goto slow_scan; + + xfs_buf_item_copy_iovec(lv, vecp, bp, offset, + first_bit, nbits); + blfp->blf_size++; + + /* + * This takes the bit number to start looking from and + * returns the next set bit from there. It returns -1 + * if there are no more bits set or the start bit is + * beyond the end of the bitmap. + */ + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, + (uint)first_bit + nbits + 1); + } while (first_bit != -1); + + return; + +slow_scan: + ASSERT(bp->b_addr == NULL); last_bit = first_bit; nbits = 1; for (;;) { @@ -300,7 +372,7 @@ xfs_buf_item_format_segment( blfp->blf_size++; break; } else if (next_bit != last_bit + 1 || - xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) { + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) { xfs_buf_item_copy_iovec(lv, vecp, bp, offset, first_bit, nbits); blfp->blf_size++; From patchwork Fri Mar 5 05:11:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117703 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F063BC43331 for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C7DAF65013 for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229569AbhCEFMS (ORCPT ); Fri, 5 Mar 2021 00:12:18 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:57733 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229582AbhCEFMB (ORCPT ); Fri, 5 Mar 2021 00:12:01 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 06EC9827940 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboL-Gm for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZN-8D for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Date: Fri, 5 Mar 2021 16:11:11 +1100 Message-Id: <20210305051143.182133-14-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=r68RYM1CTraGxGQ2onMA:9 a=g7y-_ejfhsbQHTQN:21 a=BsvyF6HemSEn-6Jy:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem with is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner --- fs/xfs/libxfs/xfs_types.h | 1 + fs/xfs/xfs_buf_item.c | 2 +- fs/xfs/xfs_dquot_item.c | 2 +- fs/xfs/xfs_file.c | 14 +++++++------- fs/xfs/xfs_inode.c | 10 +++++----- fs/xfs/xfs_inode_item.c | 4 ++-- fs/xfs/xfs_inode_item.h | 2 +- fs/xfs/xfs_log.c | 27 ++++++++++++++------------- fs/xfs/xfs_log.h | 4 +--- fs/xfs/xfs_log_cil.c | 22 +++++++++------------- fs/xfs/xfs_log_priv.h | 15 +++++++-------- fs/xfs/xfs_trans.c | 6 +++--- fs/xfs/xfs_trans.h | 4 ++-- 13 files changed, 54 insertions(+), 59 deletions(-) diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h index 064bd6e8c922..0870ef6f933d 100644 --- a/fs/xfs/libxfs/xfs_types.h +++ b/fs/xfs/libxfs/xfs_types.h @@ -21,6 +21,7 @@ typedef int32_t xfs_suminfo_t; /* type of bitmap summary info */ typedef uint32_t xfs_rtword_t; /* word type for bitmap manipulations */ typedef int64_t xfs_lsn_t; /* log sequence number */ +typedef int64_t xfs_csn_t; /* CIL sequence number */ typedef uint32_t xfs_dablk_t; /* dir/attr block number (in file) */ typedef uint32_t xfs_dahash_t; /* dir/attr hash value */ diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 14d1fefcbf4c..1cb087b320b1 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -713,7 +713,7 @@ xfs_buf_item_release( STATIC void xfs_buf_item_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + xfs_csn_t seq) { return xfs_buf_item_release(lip); } diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c index 8c1fdf37ee8f..8ed47b739b6c 100644 --- a/fs/xfs/xfs_dquot_item.c +++ b/fs/xfs/xfs_dquot_item.c @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release( STATIC void xfs_qm_dquot_logitem_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + xfs_csn_t seq) { return xfs_qm_dquot_logitem_release(lip); } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 24c7f45fc4eb..ac3120dfe477 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -119,8 +119,8 @@ xfs_dir_fsync( return xfs_log_force_inode(ip); } -static xfs_lsn_t -xfs_fsync_lsn( +static xfs_csn_t +xfs_fsync_seq( struct xfs_inode *ip, bool datasync) { @@ -128,7 +128,7 @@ xfs_fsync_lsn( return 0; if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) return 0; - return ip->i_itemp->ili_last_lsn; + return ip->i_itemp->ili_commit_seq; } /* @@ -151,12 +151,12 @@ xfs_fsync_flush_log( int *log_flushed) { int error = 0; - xfs_lsn_t lsn; + xfs_csn_t seq; xfs_ilock(ip, XFS_ILOCK_SHARED); - lsn = xfs_fsync_lsn(ip, datasync); - if (lsn) { - error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, + seq = xfs_fsync_seq(ip, datasync); + if (seq) { + error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed); spin_lock(&ip->i_itemp->ili_lock); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index bed2beb169e4..1c2ef1f1859a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2644,7 +2644,7 @@ xfs_iunpin( trace_xfs_inode_unpin_nowait(ip, _RET_IP_); /* Give the log a push to start the unpinning I/O */ - xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL); + xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL); } @@ -3652,16 +3652,16 @@ int xfs_log_force_inode( struct xfs_inode *ip) { - xfs_lsn_t lsn = 0; + xfs_csn_t seq = 0; xfs_ilock(ip, XFS_ILOCK_SHARED); if (xfs_ipincount(ip)) - lsn = ip->i_itemp->ili_last_lsn; + seq = ip->i_itemp->ili_commit_seq; xfs_iunlock(ip, XFS_ILOCK_SHARED); - if (!lsn) + if (!seq) return 0; - return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL); + return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL); } /* diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 6ff91e5bf3cd..3aba4559469f 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -617,9 +617,9 @@ xfs_inode_item_committed( STATIC void xfs_inode_item_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + xfs_csn_t seq) { - INODE_ITEM(lip)->ili_last_lsn = commit_lsn; + INODE_ITEM(lip)->ili_commit_seq = seq; return xfs_inode_item_release(lip); } diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 4b926e32831c..403b45ab9aa2 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -33,7 +33,7 @@ struct xfs_inode_log_item { unsigned int ili_fields; /* fields to be logged */ unsigned int ili_fsync_fields; /* logged since last fsync */ xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ - xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ + xfs_csn_t ili_commit_seq; /* last transaction commit */ }; static inline int xfs_inode_clean(struct xfs_inode *ip) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index ed44d67d7099..145db0f88060 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3273,14 +3273,13 @@ xfs_log_force( } static int -__xfs_log_force_lsn( - struct xfs_mount *mp, +xlog_force_lsn( + struct xlog *log, xfs_lsn_t lsn, uint flags, int *log_flushed, bool already_slept) { - struct xlog *log = mp->m_log; struct xlog_in_core *iclog; spin_lock(&log->l_icloglock); @@ -3313,8 +3312,6 @@ __xfs_log_force_lsn( if (!already_slept && (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC || iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) { - XFS_STATS_INC(mp, xs_log_force_sleep); - xlog_wait(&iclog->ic_prev->ic_write_wait, &log->l_icloglock); return -EAGAIN; @@ -3352,25 +3349,29 @@ __xfs_log_force_lsn( * to disk, that thread will wake up all threads waiting on the queue. */ int -xfs_log_force_lsn( +xfs_log_force_seq( struct xfs_mount *mp, - xfs_lsn_t lsn, + xfs_csn_t seq, uint flags, int *log_flushed) { + struct xlog *log = mp->m_log; + xfs_lsn_t lsn; int ret; - ASSERT(lsn != 0); + ASSERT(seq != 0); XFS_STATS_INC(mp, xs_log_force); - trace_xfs_log_force(mp, lsn, _RET_IP_); + trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_lsn(mp->m_log, lsn); + lsn = xlog_cil_force_seq(log, seq); if (lsn == NULLCOMMITLSN) return 0; - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false); - if (ret == -EAGAIN) - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true); + ret = xlog_force_lsn(log, lsn, flags, log_flushed, false); + if (ret == -EAGAIN) { + XFS_STATS_INC(mp, xs_log_force_sleep); + ret = xlog_force_lsn(log, lsn, flags, log_flushed, true); + } return ret; } diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 044e02cb8921..ba96f4ad9576 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -106,7 +106,7 @@ struct xfs_item_ops; struct xfs_trans; int xfs_log_force(struct xfs_mount *mp, uint flags); -int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags, +int xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags, int *log_forced); int xfs_log_mount(struct xfs_mount *mp, struct xfs_buftarg *log_target, @@ -132,8 +132,6 @@ bool xfs_log_writable(struct xfs_mount *mp); struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); -void xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, - xfs_lsn_t *commit_lsn, bool regrant); void xlog_cil_process_committed(struct list_head *list); bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 2f0adc35d8ec..44bb7cc17541 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -794,7 +794,7 @@ xlog_cil_push_work( * that higher sequences will wait for us to write out a commit record * before they do. * - * xfs_log_force_lsn requires us to mirror the new sequence into the cil + * xfs_log_force_seq requires us to mirror the new sequence into the cil * structure atomically with the addition of this sequence to the * committing list. This also ensures that we can do unlocked checks * against the current sequence in log forces without risking @@ -1058,16 +1058,14 @@ xlog_cil_empty( * allowed again. */ void -xfs_log_commit_cil( - struct xfs_mount *mp, +xlog_cil_commit( + struct xlog *log, struct xfs_trans *tp, - xfs_lsn_t *commit_lsn, + xfs_csn_t *commit_seq, bool regrant) { - struct xlog *log = mp->m_log; struct xfs_cil *cil = log->l_cilp; struct xfs_log_item *lip, *next; - xfs_lsn_t xc_commit_lsn; /* * Do all necessary memory allocation before we lock the CIL. @@ -1081,10 +1079,6 @@ xfs_log_commit_cil( xlog_cil_insert_items(log, tp); - xc_commit_lsn = cil->xc_ctx->sequence; - if (commit_lsn) - *commit_lsn = xc_commit_lsn; - if (regrant && !XLOG_FORCED_SHUTDOWN(log)) xfs_log_ticket_regrant(log, tp->t_ticket); else @@ -1107,8 +1101,10 @@ xfs_log_commit_cil( list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) { xfs_trans_del_item(lip); if (lip->li_ops->iop_committing) - lip->li_ops->iop_committing(lip, xc_commit_lsn); + lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence); } + if (commit_seq) + *commit_seq = cil->xc_ctx->sequence; /* xlog_cil_push_background() releases cil->xc_ctx_lock */ xlog_cil_push_background(log); @@ -1125,9 +1121,9 @@ xfs_log_commit_cil( * iclog flush is necessary following this call. */ xfs_lsn_t -xlog_cil_force_lsn( +xlog_cil_force_seq( struct xlog *log, - xfs_lsn_t sequence) + xfs_csn_t sequence) { struct xfs_cil *cil = log->l_cilp; struct xfs_cil_ctx *ctx; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 0552e96d2b64..31ce2ce21e27 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -234,7 +234,7 @@ struct xfs_cil; struct xfs_cil_ctx { struct xfs_cil *cil; - xfs_lsn_t sequence; /* chkpt sequence # */ + xfs_csn_t sequence; /* chkpt sequence # */ xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ struct xlog_ticket *ticket; /* chkpt ticket */ @@ -272,10 +272,10 @@ struct xfs_cil { struct xfs_cil_ctx *xc_ctx; spinlock_t xc_push_lock ____cacheline_aligned_in_smp; - xfs_lsn_t xc_push_seq; + xfs_csn_t xc_push_seq; struct list_head xc_committing; wait_queue_head_t xc_commit_wait; - xfs_lsn_t xc_current_sequence; + xfs_csn_t xc_current_sequence; struct work_struct xc_push_work; wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp; @@ -552,19 +552,18 @@ int xlog_cil_init(struct xlog *log); void xlog_cil_init_post_recovery(struct xlog *log); void xlog_cil_destroy(struct xlog *log); bool xlog_cil_empty(struct xlog *log); +void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, + xfs_csn_t *commit_seq, bool regrant); /* * CIL force routines */ -xfs_lsn_t -xlog_cil_force_lsn( - struct xlog *log, - xfs_lsn_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index b22a09e9daee..21ac7c048380 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -851,7 +851,7 @@ __xfs_trans_commit( bool regrant) { struct xfs_mount *mp = tp->t_mountp; - xfs_lsn_t commit_lsn = -1; + xfs_csn_t commit_seq = 0; int error = 0; int sync = tp->t_flags & XFS_TRANS_SYNC; @@ -893,7 +893,7 @@ __xfs_trans_commit( xfs_trans_apply_sb_deltas(tp); xfs_trans_apply_dquot_deltas(tp); - xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); + xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant); xfs_trans_free(tp); @@ -902,7 +902,7 @@ __xfs_trans_commit( * log out now and wait for it. */ if (sync) { - error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL); + error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL); XFS_STATS_INC(mp, xs_trans_sync); } else { XFS_STATS_INC(mp, xs_trans_async); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 9dd745cf77c9..6276c7d251e6 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -43,7 +43,7 @@ struct xfs_log_item { struct list_head li_cil; /* CIL pointers */ struct xfs_log_vec *li_lv; /* active log vector */ struct xfs_log_vec *li_lv_shadow; /* standby vector */ - xfs_lsn_t li_seq; /* CIL commit seq */ + xfs_csn_t li_seq; /* CIL commit seq */ }; /* @@ -69,7 +69,7 @@ struct xfs_item_ops { void (*iop_pin)(struct xfs_log_item *); void (*iop_unpin)(struct xfs_log_item *, int remove); uint (*iop_push)(struct xfs_log_item *, struct list_head *); - void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn); + void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq); void (*iop_release)(struct xfs_log_item *); xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); int (*iop_recover)(struct xfs_log_item *lip, From patchwork Fri Mar 5 05:11:12 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117713 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DAD8C433DB for ; Fri, 5 Mar 2021 05:12:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6874965013 for ; Fri, 5 Mar 2021 05:12:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229576AbhCEFMW (ORCPT ); Fri, 5 Mar 2021 00:12:22 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40566 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229575AbhCEFMA (ORCPT ); Fri, 5 Mar 2021 00:12:00 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 6432778A69A for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboO-Id for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZQ-AP for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Date: Fri, 5 Mar 2021 16:11:12 +1100 Message-Id: <20210305051143.182133-15-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=rHe3o5wPiT0kfo2B3CAA:9 a=lMS4OlSjnuOQQqOL:21 a=Eri97a0yetO2MvTf:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 59 ++++++++++++++++-------------------------- fs/xfs/xfs_log.h | 2 +- fs/xfs/xfs_log_cil.c | 58 +++++++++++++++++++++++++++++++++-------- fs/xfs/xfs_log_priv.h | 10 +++++-- fs/xfs/xfs_sysfs.c | 1 + fs/xfs/xfs_trace.c | 1 + fs/xfs/xfs_trans.c | 2 +- fs/xfs/xfs_trans_ail.c | 11 +++++--- 8 files changed, 90 insertions(+), 54 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 145db0f88060..f54d48f4584e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -50,11 +50,6 @@ xlog_state_get_iclog_space( int *continued_write, int *logoffsetp); STATIC void -xlog_state_switch_iclogs( - struct xlog *log, - struct xlog_in_core *iclog, - int eventual_size); -STATIC void xlog_grant_push_ail( struct xlog *log, int need_bytes); @@ -511,7 +506,7 @@ __xlog_state_release_iclog( * Flush iclog to disk if this is the last reference to the given iclog and the * it is in the WANT_SYNC state. */ -static int +int xlog_state_release_iclog( struct xlog *log, struct xlog_in_core *iclog) @@ -531,23 +526,6 @@ xlog_state_release_iclog( return 0; } -void -xfs_log_release_iclog( - struct xlog_in_core *iclog) -{ - struct xlog *log = iclog->ic_log; - bool sync = false; - - if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) { - if (iclog->ic_state != XLOG_STATE_IOERROR) - sync = __xlog_state_release_iclog(log, iclog); - spin_unlock(&log->l_icloglock); - } - - if (sync) - xlog_sync(log, iclog); -} - /* * Mount a log filesystem * @@ -3125,7 +3103,7 @@ xfs_log_ticket_ungrant( * This routine will mark the current iclog in the ring as WANT_SYNC and move * the current iclog pointer to the next iclog in the ring. */ -STATIC void +void xlog_state_switch_iclogs( struct xlog *log, struct xlog_in_core *iclog, @@ -3272,6 +3250,20 @@ xfs_log_force( return -EIO; } +/* + * Force the log to a specific LSN. + * + * If an iclog with that lsn can be found: + * If it is in the DIRTY state, just return. + * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC + * state and go to sleep or return. + * If it is in any other state, go to sleep or return. + * + * Synchronous forces are implemented with a wait queue. All callers trying + * to force a given lsn to disk must wait on the queue attached to the + * specific in-core log. When given in-core log finally completes its write + * to disk, that thread will wake up all threads waiting on the queue. + */ static int xlog_force_lsn( struct xlog *log, @@ -3335,18 +3327,13 @@ xlog_force_lsn( } /* - * Force the in-core log to disk for a specific LSN. - * - * Find in-core log with lsn. - * If it is in the DIRTY state, just return. - * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC - * state and go to sleep or return. - * If it is in any other state, go to sleep or return. + * Force the log to a specific checkpoint sequence. * - * Synchronous forces are implemented with a wait queue. All callers trying - * to force a given lsn to disk must wait on the queue attached to the - * specific in-core log. When given in-core log finally completes its write - * to disk, that thread will wake up all threads waiting on the queue. + * First force the CIL so that all the required changes have been flushed to the + * iclogs. If the CIL force completed it will return a commit LSN that indicates + * the iclog that needs to be flushed to stable storage. If the caller needs + * a synchronous log force, we will wait on the iclog with the LSN returned by + * xlog_cil_force_seq() to be completed. */ int xfs_log_force_seq( @@ -3363,7 +3350,7 @@ xfs_log_force_seq( XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_seq(log, seq); + lsn = xlog_cil_force_seq(log, XFS_LOG_SYNC, seq); if (lsn == NULLCOMMITLSN) return 0; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index ba96f4ad9576..1bd080ce3a95 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -104,6 +104,7 @@ struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; struct xfs_trans; +struct xlog; int xfs_log_force(struct xfs_mount *mp, uint flags); int xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags, @@ -117,7 +118,6 @@ void xfs_log_mount_cancel(struct xfs_mount *); xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp); void xfs_log_space_wake(struct xfs_mount *mp); -void xfs_log_release_iclog(struct xlog_in_core *iclog); int xfs_log_reserve(struct xfs_mount *mp, int length, int count, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 44bb7cc17541..b101c25cc9a9 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -658,6 +658,7 @@ xlog_cil_push_work( xfs_lsn_t push_seq; struct bio bio; DECLARE_COMPLETION_ONSTACK(bdev_flush); + bool commit_iclog_sync = false; new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -668,6 +669,8 @@ xlog_cil_push_work( spin_lock(&cil->xc_push_lock); push_seq = cil->xc_push_seq; ASSERT(push_seq <= ctx->sequence); + commit_iclog_sync = cil->xc_push_async; + cil->xc_push_async = false; /* * As we are about to switch to a new, empty CIL context, we no longer @@ -914,7 +917,11 @@ xlog_cil_push_work( } /* release the hounds! */ - xfs_log_release_iclog(commit_iclog); + spin_lock(&log->l_icloglock); + if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE) + xlog_state_switch_iclogs(log, commit_iclog, 0); + xlog_state_release_iclog(log, commit_iclog); + spin_unlock(&log->l_icloglock); return; out_skip: @@ -997,13 +1004,26 @@ xlog_cil_push_background( /* * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence * number that is passed. When it returns, the work will be queued for - * @push_seq, but it won't be completed. The caller is expected to do any - * waiting for push_seq to complete if it is required. + * @push_seq, but it won't be completed. + * + * If the caller is performing a synchronous force, we will flush the workqueue + * to get previously queued work moving to minimise the wait time they will + * undergo waiting for all outstanding pushes to complete. The caller is + * expected to do the required waiting for push_seq to complete. + * + * If the caller is performing an async push, we need to ensure that the + * checkpoint is fully flushed out of the iclogs when we finish the push. If we + * don't do this, then the commit record may remain sitting in memory in an + * ACTIVE iclog. This then requires another full log force to push to disk, + * which defeats the purpose of having an async, non-blocking CIL force + * mechanism. Hence in this case we need to pass a flag to the push work to + * indicate it needs to flush the commit record itself. */ static void xlog_cil_push_now( struct xlog *log, - xfs_lsn_t push_seq) + xfs_lsn_t push_seq, + bool sync) { struct xfs_cil *cil = log->l_cilp; @@ -1013,7 +1033,8 @@ xlog_cil_push_now( ASSERT(push_seq && push_seq <= cil->xc_current_sequence); /* start on any pending background push to minimise wait time on it */ - flush_work(&cil->xc_push_work); + if (sync) + flush_work(&cil->xc_push_work); /* * If the CIL is empty or we've already pushed the sequence then @@ -1026,6 +1047,8 @@ xlog_cil_push_now( } cil->xc_push_seq = push_seq; + if (!sync) + cil->xc_push_async = true; queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); spin_unlock(&cil->xc_push_lock); } @@ -1113,16 +1136,22 @@ xlog_cil_commit( /* * Conditionally push the CIL based on the sequence passed in. * - * We only need to push if we haven't already pushed the sequence - * number given. Hence the only time we will trigger a push here is - * if the push sequence is the same as the current context. + * We only need to push if we haven't already pushed the sequence number given. + * Hence the only time we will trigger a push here is if the push sequence is + * the same as the current context. * - * We return the current commit lsn to allow the callers to determine if a - * iclog flush is necessary following this call. + * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in + * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to + * indicate we didn't wait for a commit lsn. + * + * If we waited for the push to complete, then we return the current commit lsn + * to allow the callers to determine if a iclog flush is necessary following + * this call. */ xfs_lsn_t xlog_cil_force_seq( struct xlog *log, + uint32_t flags, xfs_csn_t sequence) { struct xfs_cil *cil = log->l_cilp; @@ -1131,13 +1160,19 @@ xlog_cil_force_seq( ASSERT(sequence <= cil->xc_current_sequence); + if (!sequence) + sequence = cil->xc_current_sequence; + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); + /* * check to see if we need to force out the current context. * xlog_cil_push() handles racing pushes for the same sequence, * so no need to deal with it here. */ restart: - xlog_cil_push_now(log, sequence); + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); + if (!(flags & XFS_LOG_SYNC)) + return commit_lsn; /* * See if we can find a previous sequence still committing. @@ -1161,6 +1196,7 @@ xlog_cil_force_seq( * It is still being pushed! Wait for the push to * complete, then start again from the beginning. */ + XFS_STATS_INC(log->l_mp, xs_log_force_sleep); xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); goto restart; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 31ce2ce21e27..a4e46258b2aa 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -273,6 +273,7 @@ struct xfs_cil { spinlock_t xc_push_lock ____cacheline_aligned_in_smp; xfs_csn_t xc_push_seq; + bool xc_push_async; struct list_head xc_committing; wait_queue_head_t xc_commit_wait; xfs_csn_t xc_current_sequence; @@ -487,6 +488,10 @@ int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_in_core **commit_iclog, uint optype); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); +void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, + int eventual_size); +int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog); + void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); @@ -558,12 +563,13 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, /* * CIL force routines */ -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags, + xfs_csn_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index f1bc88f4367c..18dc5eca6c04 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -10,6 +10,7 @@ #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_sysfs.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_mount.h" diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 9b8d703dc9fd..d111a994b7b6 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -20,6 +20,7 @@ #include "xfs_bmap.h" #include "xfs_attr.h" #include "xfs_trans.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_buf_item.h" #include "xfs_quota.h" diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 21ac7c048380..52f3fdf1e0de 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -9,7 +9,6 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" -#include "xfs_log_priv.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_extent_busy.h" @@ -17,6 +16,7 @@ #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_error.h" #include "xfs_defer.h" diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index dbb69b4bf3ed..dfc0206c0d36 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -17,6 +17,7 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #ifdef DEBUG /* @@ -429,8 +430,12 @@ xfsaild_push( /* * If we encountered pinned items or did not finish writing out all - * buffers the last time we ran, force the log first and wait for it - * before pushing again. + * buffers the last time we ran, force a background CIL push to get the + * items unpinned in the near future. We do not wait on the CIL push as + * that could stall us for seconds if there is enough background IO + * load. Stalling for that long when the tail of the log is pinned and + * needs flushing will hard stop the transaction subsystem when log + * space runs out. */ if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && (!list_empty_careful(&ailp->ail_buf_list) || @@ -438,7 +443,7 @@ xfsaild_push( ailp->ail_log_flush = 0; XFS_STATS_INC(mp, xs_push_ail_flush); - xfs_log_force(mp, XFS_LOG_SYNC); + xlog_cil_force_seq(mp->m_log, 0, 0); } spin_lock(&ailp->ail_lock); From patchwork Fri Mar 5 05:11:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117711 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76AC3C433E0 for ; Fri, 5 Mar 2021 05:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4E12865004 for ; Fri, 5 Mar 2021 05:12:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229560AbhCEFMV (ORCPT ); Fri, 5 Mar 2021 00:12:21 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36884 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229576AbhCEFMA (ORCPT ); Fri, 5 Mar 2021 00:12:00 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 658C4ECB218 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboQ-JQ for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZT-Bs for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 15/45] xfs: CIL work is serialised, not pipelined Date: Fri, 5 Mar 2021 16:11:13 +1100 Message-Id: <20210305051143.182133-16-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=2XaAgQyQIFTc9u6BXUUA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. THis is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 80 +++++++++++++++++++++++-------------------- fs/xfs/xfs_log_priv.h | 2 +- fs/xfs/xfs_super.c | 2 +- 3 files changed, 44 insertions(+), 40 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b101c25cc9a9..dfc9ef692a80 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -47,6 +47,34 @@ xlog_cil_ticket_alloc( return tic; } +/* + * Unavoidable forward declaration - xlog_cil_push_work() calls + * xlog_cil_ctx_alloc() itself. + */ +static void xlog_cil_push_work(struct work_struct *work); + +static struct xfs_cil_ctx * +xlog_cil_ctx_alloc(void) +{ + struct xfs_cil_ctx *ctx; + + ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS); + INIT_LIST_HEAD(&ctx->committing); + INIT_LIST_HEAD(&ctx->busy_extents); + INIT_WORK(&ctx->push_work, xlog_cil_push_work); + return ctx; +} + +static void +xlog_cil_ctx_switch( + struct xfs_cil *cil, + struct xfs_cil_ctx *ctx) +{ + ctx->sequence = ++cil->xc_current_sequence; + ctx->cil = cil; + cil->xc_ctx = ctx; +} + /* * After the first stage of log recovery is done, we know where the head and * tail of the log are. We need this log initialisation done before we can @@ -641,11 +669,11 @@ static void xlog_cil_push_work( struct work_struct *work) { - struct xfs_cil *cil = - container_of(work, struct xfs_cil, xc_push_work); + struct xfs_cil_ctx *ctx = + container_of(work, struct xfs_cil_ctx, push_work); + struct xfs_cil *cil = ctx->cil; struct xlog *log = cil->xc_log; struct xfs_log_vec *lv; - struct xfs_cil_ctx *ctx; struct xfs_cil_ctx *new_ctx; struct xlog_in_core *commit_iclog; struct xlog_ticket *tic; @@ -660,11 +688,10 @@ xlog_cil_push_work( DECLARE_COMPLETION_ONSTACK(bdev_flush); bool commit_iclog_sync = false; - new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); + new_ctx = xlog_cil_ctx_alloc(); new_ctx->ticket = xlog_cil_ticket_alloc(log); down_write(&cil->xc_ctx_lock); - ctx = cil->xc_ctx; spin_lock(&cil->xc_push_lock); push_seq = cil->xc_push_seq; @@ -696,7 +723,7 @@ xlog_cil_push_work( /* check for a previously pushed sequence */ - if (push_seq < cil->xc_ctx->sequence) { + if (push_seq < ctx->sequence) { spin_unlock(&cil->xc_push_lock); goto out_skip; } @@ -767,19 +794,7 @@ xlog_cil_push_work( } /* - * initialise the new context and attach it to the CIL. Then attach - * the current context to the CIL committing list so it can be found - * during log forces to extract the commit lsn of the sequence that - * needs to be forced. - */ - INIT_LIST_HEAD(&new_ctx->committing); - INIT_LIST_HEAD(&new_ctx->busy_extents); - new_ctx->sequence = ctx->sequence + 1; - new_ctx->cil = cil; - cil->xc_ctx = new_ctx; - - /* - * The switch is now done, so we can drop the context lock and move out + * Switch the contexts so we can drop the context lock and move out * of a shared context. We can't just go straight to the commit record, * though - we need to synchronise with previous and future commits so * that the commit records are correctly ordered in the log to ensure @@ -804,7 +819,7 @@ xlog_cil_push_work( * deferencing a freed context pointer. */ spin_lock(&cil->xc_push_lock); - cil->xc_current_sequence = new_ctx->sequence; + xlog_cil_ctx_switch(cil, new_ctx); spin_unlock(&cil->xc_push_lock); up_write(&cil->xc_ctx_lock); @@ -968,7 +983,7 @@ xlog_cil_push_background( spin_lock(&cil->xc_push_lock); if (cil->xc_push_seq < cil->xc_current_sequence) { cil->xc_push_seq = cil->xc_current_sequence; - queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work); } /* @@ -1034,7 +1049,7 @@ xlog_cil_push_now( /* start on any pending background push to minimise wait time on it */ if (sync) - flush_work(&cil->xc_push_work); + flush_workqueue(log->l_mp->m_cil_workqueue); /* * If the CIL is empty or we've already pushed the sequence then @@ -1049,7 +1064,7 @@ xlog_cil_push_now( cil->xc_push_seq = push_seq; if (!sync) cil->xc_push_async = true; - queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work); spin_unlock(&cil->xc_push_lock); } @@ -1286,13 +1301,6 @@ xlog_cil_init( if (!cil) return -ENOMEM; - ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL); - if (!ctx) { - kmem_free(cil); - return -ENOMEM; - } - - INIT_WORK(&cil->xc_push_work, xlog_cil_push_work); INIT_LIST_HEAD(&cil->xc_cil); INIT_LIST_HEAD(&cil->xc_committing); spin_lock_init(&cil->xc_cil_lock); @@ -1300,16 +1308,12 @@ xlog_cil_init( init_waitqueue_head(&cil->xc_push_wait); init_rwsem(&cil->xc_ctx_lock); init_waitqueue_head(&cil->xc_commit_wait); - - INIT_LIST_HEAD(&ctx->committing); - INIT_LIST_HEAD(&ctx->busy_extents); - ctx->sequence = 1; - ctx->cil = cil; - cil->xc_ctx = ctx; - cil->xc_current_sequence = ctx->sequence; - cil->xc_log = log; log->l_cilp = cil; + + ctx = xlog_cil_ctx_alloc(); + xlog_cil_ctx_switch(cil, ctx); + return 0; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index a4e46258b2aa..bb5fa6b71114 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -245,6 +245,7 @@ struct xfs_cil_ctx { struct list_head iclog_entry; struct list_head committing; /* ctx committing list */ struct work_struct discard_endio_work; + struct work_struct push_work; }; /* @@ -277,7 +278,6 @@ struct xfs_cil { struct list_head xc_committing; wait_queue_head_t xc_commit_wait; xfs_csn_t xc_current_sequence; - struct work_struct xc_push_work; wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index ca2cb0448b5e..962f03a541e7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -502,7 +502,7 @@ xfs_init_mount_workqueues( mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s", XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND), - 0, mp->m_super->s_id); + 4, mp->m_super->s_id); if (!mp->m_cil_workqueue) goto out_destroy_unwritten; From patchwork Fri Mar 5 05:11:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117683 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B36D8C43381 for ; Fri, 5 Mar 2021 05:11:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 94C0765013 for ; Fri, 5 Mar 2021 05:11:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229616AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:58881 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbhCEFLw (ORCPT ); Fri, 5 Mar 2021 00:11:52 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 2027B1041377 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboU-Kx for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZW-Cr for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 16/45] xfs: type verification is expensive Date: Fri, 5 Mar 2021 16:11:14 +1100 Message-Id: <20210305051143.182133-17-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=NuCpFDova00jneZWLDgA:9 a=1u2EsRlddwQMu84T:21 a=xfdbkvumHq1WlvTV:21 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner From a concurrent rm -rf workload: 41.04% [kernel] [k] xfs_dir3_leaf_check_int 9.85% [kernel] [k] __xfs_dir3_data_check 5.60% [kernel] [k] xfs_verify_ino 5.32% [kernel] [k] xfs_agino_range 4.21% [kernel] [k] memcpy 3.06% [kernel] [k] xfs_errortag_test 2.57% [kernel] [k] xfs_dir_ino_validate 1.66% [kernel] [k] xfs_dir2_data_get_ftype 1.17% [kernel] [k] do_raw_spin_lock 1.11% [kernel] [k] xfs_verify_dir_ino 0.84% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.83% [kernel] [k] xfs_buf_find 0.64% [kernel] [k] xfs_log_commit_cil THere's an awful lot of overhead in just range checking inode numbers in that, but each inode number check is not a lot of code. The total is a bit over 14.5% of the CPU time is spent validating inode numbers. The problem is that they deeply nested global scope functions so the overhead here is all in function call marshalling. text data bss dec hex filename 2077 0 0 2077 81d fs/xfs/libxfs/xfs_types.o.orig 2197 0 0 2197 895 fs/xfs/libxfs/xfs_types.o There's a small increase in binary size by inlining all the local nested calls in the verifier functions, but the same workload now profiles as: 40.69% [kernel] [k] xfs_dir3_leaf_check_int 10.52% [kernel] [k] __xfs_dir3_data_check 6.68% [kernel] [k] xfs_verify_dir_ino 4.22% [kernel] [k] xfs_errortag_test 4.15% [kernel] [k] memcpy 3.53% [kernel] [k] xfs_dir_ino_validate 1.87% [kernel] [k] xfs_dir2_data_get_ftype 1.37% [kernel] [k] do_raw_spin_lock 0.98% [kernel] [k] xfs_buf_find 0.94% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.73% [kernel] [k] xfs_log_commit_cil Now we only spend just over 10% of the time validing inode numbers for the same workload. Hence a few "inline" keyworks is good enough to reduce the validation overhead by 30%... Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_types.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c index b254fbeaaa50..04801362e1a7 100644 --- a/fs/xfs/libxfs/xfs_types.c +++ b/fs/xfs/libxfs/xfs_types.c @@ -13,7 +13,7 @@ #include "xfs_mount.h" /* Find the size of the AG, in blocks. */ -xfs_agblock_t +inline xfs_agblock_t xfs_ag_block_count( struct xfs_mount *mp, xfs_agnumber_t agno) @@ -29,7 +29,7 @@ xfs_ag_block_count( * Verify that an AG block number pointer neither points outside the AG * nor points at static metadata. */ -bool +inline bool xfs_verify_agbno( struct xfs_mount *mp, xfs_agnumber_t agno, @@ -49,7 +49,7 @@ xfs_verify_agbno( * Verify that an FS block number pointer neither points outside the * filesystem nor points at static AG metadata. */ -bool +inline bool xfs_verify_fsbno( struct xfs_mount *mp, xfs_fsblock_t fsbno) @@ -85,7 +85,7 @@ xfs_verify_fsbext( } /* Calculate the first and last possible inode number in an AG. */ -void +inline void xfs_agino_range( struct xfs_mount *mp, xfs_agnumber_t agno, @@ -116,7 +116,7 @@ xfs_agino_range( * Verify that an AG inode number pointer neither points outside the AG * nor points at static metadata. */ -bool +inline bool xfs_verify_agino( struct xfs_mount *mp, xfs_agnumber_t agno, @@ -146,7 +146,7 @@ xfs_verify_agino_or_null( * Verify that an FS inode number pointer neither points outside the * filesystem nor points at static AG metadata. */ -bool +inline bool xfs_verify_ino( struct xfs_mount *mp, xfs_ino_t ino) @@ -162,7 +162,7 @@ xfs_verify_ino( } /* Is this an internal inode number? */ -bool +inline bool xfs_internal_inum( struct xfs_mount *mp, xfs_ino_t ino) @@ -190,7 +190,7 @@ xfs_verify_dir_ino( * Verify that an realtime block number pointer doesn't point off the * end of the realtime device. */ -bool +inline bool xfs_verify_rtbno( struct xfs_mount *mp, xfs_rtblock_t rtbno) @@ -215,7 +215,7 @@ xfs_verify_rtext( } /* Calculate the range of valid icount values. */ -void +inline void xfs_icount_range( struct xfs_mount *mp, unsigned long long *min, From patchwork Fri Mar 5 05:11:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117707 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E8A5C433E9 for ; Fri, 5 Mar 2021 05:12:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 11C5C65013 for ; Fri, 5 Mar 2021 05:12:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229582AbhCEFMS (ORCPT ); Fri, 5 Mar 2021 00:12:18 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:59333 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229465AbhCEFMA (ORCPT ); Fri, 5 Mar 2021 00:12:00 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 651E3104107C for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboX-N0 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZZ-EL for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Date: Fri, 5 Mar 2021 16:11:15 +1100 Message-Id: <20210305051143.182133-18-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=_ii9FaXb6EwW4meCbCgA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We call xfs_dir_ino_validate() for every dir entry in a directory when doing validity checking of the directory. It calls xfs_verify_dir_ino() then emits a corruption report if bad or does error injection if good. It is extremely costly: 43.27% [kernel] [k] xfs_dir3_leaf_check_int 10.28% [kernel] [k] __xfs_dir3_data_check 6.61% [kernel] [k] xfs_verify_dir_ino 4.16% [kernel] [k] xfs_errortag_test 4.00% [kernel] [k] memcpy 3.48% [kernel] [k] xfs_dir_ino_validate 7% of the cpu usage in this directory traversal workload is xfs_dir_ino_validate() doing absolutely nothing. We don't need error injection to simulate a bad inode numbers in the directory structure because we can do that by fuzzing the structure on disk. And we don't need a corruption report, because the __xfs_dir3_data_check() will emit one if the inode number is bad. So just call xfs_verify_dir_ino() directly here, and get rid of all this unnecessary overhead: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/libxfs/xfs_dir2_data.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c index 375b3edb2ad2..e67fa086f2c1 100644 --- a/fs/xfs/libxfs/xfs_dir2_data.c +++ b/fs/xfs/libxfs/xfs_dir2_data.c @@ -218,7 +218,7 @@ __xfs_dir3_data_check( */ if (dep->namelen == 0) return __this_address; - if (xfs_dir_ino_validate(mp, be64_to_cpu(dep->inumber))) + if (!xfs_verify_dir_ino(mp, be64_to_cpu(dep->inumber))) return __this_address; if (offset + xfs_dir2_data_entsize(mp, dep->namelen) > end) return __this_address; From patchwork Fri Mar 5 05:11:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117725 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BF6DC433E9 for ; Fri, 5 Mar 2021 05:12:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5A40865016 for ; Fri, 5 Mar 2021 05:12:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229597AbhCEFM1 (ORCPT ); Fri, 5 Mar 2021 00:12:27 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55931 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbhCEFMD (ORCPT ); Fri, 5 Mar 2021 00:12:03 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 65D0762F7D for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00FboZ-Nn for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZc-G7 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Date: Fri, 5 Mar 2021 16:11:16 +1100 Message-Id: <20210305051143.182133-19-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=VwQbUJbxAAAA:8 a=es5hYbETAauG9ijk2hQA:9 a=AjGcO6oz07-iQ99wixmX:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner On debug kernels, we call xfs_dir3_leaf_check_int() multiple times on every directory modification. The robust hash ordering checks it does on every entry in the leaf on every call results in a massive CPU overhead which slows down debug kernels by a large amount. We use xfs_dir3_leaf_check_int() for the verifiers as well, so we can't just gut the function to reduce overhead. What we can do, however, is reduce the work it does when it is called from the debug interfaces, just leaving the high level checks in place and leaving the robust validation to the verifiers. This means the debug checks will catch gross errors, but subtle bugs might not be caught until a verifier is run. It is easy enough to restore the existing debug behaviour if the developer needs it (just change a call parameter in the debug code), but overwise the overhead makes testing large directory block sizes on debug kernels very slow. Profile at an unlink rate of ~80k file/s on a 64k block size filesystem before the patch: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Profile after, at an unlink rate of ~125k files/s (+50% improvement) has largely dropped the leaf verification debug overhead out of the profile. 16.53% [kernel] [k] __xfs_dir3_data_check 12.53% [kernel] [k] xfs_verify_dir_ino 7.97% [kernel] [k] memcpy 3.36% [kernel] [k] xfs_dir2_data_get_ftype 2.86% [kernel] [k] __pv_queued_spin_lock_slowpath Create shows a similar change in profile and a +25% improvement in performance. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/libxfs/xfs_dir2_leaf.c | 10 +++++++--- fs/xfs/libxfs/xfs_dir2_node.c | 2 +- fs/xfs/libxfs/xfs_dir2_priv.h | 3 ++- 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 95d2a3f92d75..ccd8d0aa62b8 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -113,7 +113,7 @@ xfs_dir3_leaf1_check( } else if (leafhdr.magic != XFS_DIR2_LEAF1_MAGIC) return __this_address; - return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf); + return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf, false); } static inline void @@ -139,7 +139,8 @@ xfs_failaddr_t xfs_dir3_leaf_check_int( struct xfs_mount *mp, struct xfs_dir3_icleaf_hdr *hdr, - struct xfs_dir2_leaf *leaf) + struct xfs_dir2_leaf *leaf, + bool expensive_checking) { struct xfs_da_geometry *geo = mp->m_dir_geo; xfs_dir2_leaf_tail_t *ltp; @@ -162,6 +163,9 @@ xfs_dir3_leaf_check_int( (char *)&hdr->ents[hdr->count] > (char *)xfs_dir2_leaf_bests_p(ltp)) return __this_address; + if (!expensive_checking) + return NULL; + /* Check hash value order, count stale entries. */ for (i = stale = 0; i < hdr->count; i++) { if (i + 1 < hdr->count) { @@ -195,7 +199,7 @@ xfs_dir3_leaf_verify( return fa; xfs_dir2_leaf_hdr_from_disk(mp, &leafhdr, bp->b_addr); - return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr); + return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr, true); } static void diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index 5d51265d29d6..80a64117b460 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -73,7 +73,7 @@ xfs_dir3_leafn_check( } else if (leafhdr.magic != XFS_DIR2_LEAFN_MAGIC) return __this_address; - return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf); + return xfs_dir3_leaf_check_int(dp->i_mount, &leafhdr, leaf, false); } static inline void diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index 44c6a77cba05..94943ce49cab 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -127,7 +127,8 @@ xfs_dir3_leaf_find_entry(struct xfs_dir3_icleaf_hdr *leafhdr, extern int xfs_dir2_node_to_leaf(struct xfs_da_state *state); extern xfs_failaddr_t xfs_dir3_leaf_check_int(struct xfs_mount *mp, - struct xfs_dir3_icleaf_hdr *hdr, struct xfs_dir2_leaf *leaf); + struct xfs_dir3_icleaf_hdr *hdr, struct xfs_dir2_leaf *leaf, + bool expensive_checks); /* xfs_dir2_node.c */ void xfs_dir2_free_hdr_from_disk(struct xfs_mount *mp, From patchwork Fri Mar 5 05:11:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117739 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E666C4332B for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E37B665014 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229592AbhCEFMy (ORCPT ); Fri, 5 Mar 2021 00:12:54 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58877 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229670AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 851E1827FEE for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbod-Ox for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZf-H6 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 19/45] xfs: factor out the CIL transaction header building Date: Fri, 5 Mar 2021 16:11:17 +1100 Message-Id: <20210305051143.182133-20-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=LUy3kMpZBJK1m2kfqSYA:9 a=t3JqoWTAWdghYSfI:21 a=LuF9V3Dyg09g15CV:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner It is static code deep in the middle of the CIL push logic. Factor it out into a helper so that it is clear and easy to modify separately. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_log_cil.c | 71 +++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index dfc9ef692a80..b515002e7959 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -651,6 +651,41 @@ xlog_cil_process_committed( } } +struct xlog_cil_trans_hdr { + struct xfs_trans_header thdr; + struct xfs_log_iovec lhdr; +}; + +/* + * Build a checkpoint transaction header to begin the journal transaction. We + * need to account for the space used by the transaction header here as it is + * not accounted for in xlog_write(). + */ +static void +xlog_cil_build_trans_hdr( + struct xfs_cil_ctx *ctx, + struct xlog_cil_trans_hdr *hdr, + struct xfs_log_vec *lvhdr, + int num_iovecs) +{ + struct xlog_ticket *tic = ctx->ticket; + + memset(hdr, 0, sizeof(*hdr)); + + hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC; + hdr->thdr.th_type = XFS_TRANS_CHECKPOINT; + hdr->thdr.th_tid = tic->t_tid; + hdr->thdr.th_num_items = num_iovecs; + hdr->lhdr.i_addr = &hdr->thdr; + hdr->lhdr.i_len = sizeof(xfs_trans_header_t); + hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; + tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t); + + lvhdr->lv_niovecs = 1; + lvhdr->lv_iovecp = &hdr->lhdr; + lvhdr->lv_next = ctx->lv_chain; +} + /* * Push the Committed Item List to the log. * @@ -676,11 +711,9 @@ xlog_cil_push_work( struct xfs_log_vec *lv; struct xfs_cil_ctx *new_ctx; struct xlog_in_core *commit_iclog; - struct xlog_ticket *tic; int num_iovecs; int error = 0; - struct xfs_trans_header thdr; - struct xfs_log_iovec lhdr; + struct xlog_cil_trans_hdr thdr; struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; @@ -827,24 +860,8 @@ xlog_cil_push_work( * Build a checkpoint transaction header and write it to the log to * begin the transaction. We need to account for the space used by the * transaction header here as it is not accounted for in xlog_write(). - * - * The LSN we need to pass to the log items on transaction commit is - * the LSN reported by the first log vector write. If we use the commit - * record lsn then we can move the tail beyond the grant write head. */ - tic = ctx->ticket; - thdr.th_magic = XFS_TRANS_HEADER_MAGIC; - thdr.th_type = XFS_TRANS_CHECKPOINT; - thdr.th_tid = tic->t_tid; - thdr.th_num_items = num_iovecs; - lhdr.i_addr = &thdr; - lhdr.i_len = sizeof(xfs_trans_header_t); - lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; - tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t); - - lvhdr.lv_niovecs = 1; - lvhdr.lv_iovecp = &lhdr; - lvhdr.lv_next = ctx->lv_chain; + xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs); /* * Before we format and submit the first iclog, we have to ensure that @@ -852,7 +869,13 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush); - error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, + /* + * The LSN we need to pass to the log items on transaction commit is the + * LSN reported by the first log vector write, not the commit lsn. If we + * use the commit record lsn then we can move the tail beyond the grant + * write head. + */ + error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL, XLOG_START_TRANS); if (error) goto out_abort_free_ticket; @@ -891,11 +914,11 @@ xlog_cil_push_work( } spin_unlock(&cil->xc_push_lock); - error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn); + error = xlog_commit_record(log, ctx->ticket, &commit_iclog, &commit_lsn); if (error) goto out_abort_free_ticket; - xfs_log_ticket_ungrant(log, tic); + xfs_log_ticket_ungrant(log, ctx->ticket); spin_lock(&commit_iclog->ic_callback_lock); if (commit_iclog->ic_state == XLOG_STATE_IOERROR) { @@ -946,7 +969,7 @@ xlog_cil_push_work( return; out_abort_free_ticket: - xfs_log_ticket_ungrant(log, tic); + xfs_log_ticket_ungrant(log, ctx->ticket); out_abort: ASSERT(XLOG_FORCED_SHUTDOWN(log)); xlog_cil_committed(ctx); From patchwork Fri Mar 5 05:11:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117731 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABF1EC433E0 for ; Fri, 5 Mar 2021 05:12:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 88AFC65013 for ; Fri, 5 Mar 2021 05:12:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229604AbhCEFM2 (ORCPT ); Fri, 5 Mar 2021 00:12:28 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55933 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229602AbhCEFMD (ORCPT ); Fri, 5 Mar 2021 00:12:03 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 65D7B62F81 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbog-Qc for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZi-IQ for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 20/45] xfs: only CIL pushes require a start record Date: Fri, 5 Mar 2021 16:11:18 +1100 Message-Id: <20210305051143.182133-21-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=KPKL3CcRt_6iA-bUn9MA:9 a=Echpx2VvfMwB_eDK:21 a=YDqPLSn3tuCoHY7m:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner So move the one-off start record writing in xlog_write() out into the static header that the CIL push builds to write into the log initially. This simplifes the xlog_write() logic a lot. pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly 32 bit aligned and packed for copying the log op and transaction headers directly into the log as a single log region copy. struct xlog_cil_trans_hdr { struct xlog_op_header oph[2]; /* 0 24 */ struct xfs_trans_header thdr; /* 24 16 */ struct xfs_log_iovec lhdr; /* 40 16 */ /* size: 56, cachelines: 1, members: 3 */ /* last cacheline: 56 bytes */ }; A wart is needed to handle the fact that length of the region the opheader points to doesn't include the opheader length. hence if we embed the opheader, we have to substract the opheader length from the length written into the opheader by the generic copying code. This will eventually go away when everything is converted to embedded opheaders. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 90 ++++++++++++++++++++++---------------------- fs/xfs/xfs_log_cil.c | 44 ++++++++++++++++++---- 2 files changed, 81 insertions(+), 53 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index f54d48f4584e..b2f9fb1b4fed 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2106,9 +2106,9 @@ xlog_print_trans( } /* - * Calculate the potential space needed by the log vector. We may need a start - * record, and each region gets its own struct xlog_op_header and may need to be - * double word aligned. + * Calculate the potential space needed by the log vector. If this is a start + * transaction, the caller has already accounted for both opheaders in the start + * transaction, so we don't need to account for them here. */ static int xlog_write_calc_vec_length( @@ -2121,9 +2121,6 @@ xlog_write_calc_vec_length( int len = 0; int i; - if (optype & XLOG_START_TRANS) - headers++; - for (lv = log_vector; lv; lv = lv->lv_next) { /* we don't write ordered log vectors */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) @@ -2139,24 +2136,20 @@ xlog_write_calc_vec_length( } } + /* Don't account for regions with embedded ophdrs */ + if (optype && headers > 0) { + if (optype & XLOG_START_TRANS) { + ASSERT(headers >= 2); + headers -= 2; + } + } + ticket->t_res_num_ophdrs += headers; len += headers * sizeof(struct xlog_op_header); return len; } -static void -xlog_write_start_rec( - struct xlog_op_header *ophdr, - struct xlog_ticket *ticket) -{ - ophdr->oh_tid = cpu_to_be32(ticket->t_tid); - ophdr->oh_clientid = ticket->t_clientid; - ophdr->oh_len = 0; - ophdr->oh_flags = XLOG_START_TRANS; - ophdr->oh_res2 = 0; -} - static xlog_op_header_t * xlog_write_setup_ophdr( struct xlog *log, @@ -2361,9 +2354,11 @@ xlog_write( * If this is a commit or unmount transaction, we don't need a start * record to be written. We do, however, have to account for the * commit or unmount header that gets written. Hence we always have - * to account for an extra xlog_op_header here. + * to account for an extra xlog_op_header here for commit and unmount + * records. */ - ticket->t_curr_res -= sizeof(struct xlog_op_header); + if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) + ticket->t_curr_res -= sizeof(struct xlog_op_header); if (ticket->t_curr_res < 0) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "ctx ticket reservation ran out. Need to up reservation"); @@ -2411,7 +2406,7 @@ xlog_write( int copy_len; int copy_off; bool ordered = false; - bool wrote_start_rec = false; + bool added_ophdr = false; /* ordered log vectors have no regions to write */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) { @@ -2425,25 +2420,24 @@ xlog_write( ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); /* - * Before we start formatting log vectors, we need to - * write a start record. Only do this for the first - * iclog we write to. + * The XLOG_START_TRANS has embedded ophdrs for the + * start record and transaction header. They will always + * be the first two regions in the lv chain. */ if (optype & XLOG_START_TRANS) { - xlog_write_start_rec(ptr, ticket); - xlog_write_adv_cnt(&ptr, &len, &log_offset, - sizeof(struct xlog_op_header)); - optype &= ~XLOG_START_TRANS; - wrote_start_rec = true; - } - - ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype); - if (!ophdr) - return -EIO; + ophdr = reg->i_addr; + if (index) + optype &= ~XLOG_START_TRANS; + } else { + ophdr = xlog_write_setup_ophdr(log, ptr, + ticket, optype); + if (!ophdr) + return -EIO; - xlog_write_adv_cnt(&ptr, &len, &log_offset, + xlog_write_adv_cnt(&ptr, &len, &log_offset, sizeof(struct xlog_op_header)); - + added_ophdr = true; + } len += xlog_write_setup_copy(ticket, ophdr, iclog->ic_size-log_offset, reg->i_len, @@ -2452,13 +2446,22 @@ xlog_write( &partial_copy_len); xlog_verify_dest_ptr(log, ptr); + + /* + * Wart: need to update length in embedded ophdr not + * to include it's own length. + */ + if (!added_ophdr) { + ophdr->oh_len = cpu_to_be32(copy_len - + sizeof(struct xlog_op_header)); + } /* * Copy region. * - * Unmount records just log an opheader, so can have - * empty payloads with no data region to copy. Hence we - * only copy the payload if the vector says it has data - * to copy. + * Commit and unmount records just log an opheader, so + * we can have empty payloads with no data region to + * copy. Hence we only copy the payload if the vector + * says it has data to copy. */ ASSERT(copy_len >= 0); if (copy_len > 0) { @@ -2466,12 +2469,9 @@ xlog_write( xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len); } - copy_len += sizeof(struct xlog_op_header); - record_cnt++; - if (wrote_start_rec) { + if (added_ophdr) copy_len += sizeof(struct xlog_op_header); - record_cnt++; - } + record_cnt++; data_cnt += contwr ? copy_len : 0; error = xlog_write_copy_finish(log, iclog, optype, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b515002e7959..e9da074ecd69 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -652,14 +652,22 @@ xlog_cil_process_committed( } struct xlog_cil_trans_hdr { + struct xlog_op_header oph[2]; struct xfs_trans_header thdr; - struct xfs_log_iovec lhdr; + struct xfs_log_iovec lhdr[2]; }; /* * Build a checkpoint transaction header to begin the journal transaction. We * need to account for the space used by the transaction header here as it is * not accounted for in xlog_write(). + * + * This is the only place we write a transaction header, so we also build the + * log opheaders that indicate the start of a log transaction and wrap the + * transaction header. We keep the start record in it's own log vector rather + * than compacting them into a single region as this ends up making the logic + * in xlog_write() for handling empty opheaders for start, commit and unmount + * records much simpler. */ static void xlog_cil_build_trans_hdr( @@ -669,20 +677,40 @@ xlog_cil_build_trans_hdr( int num_iovecs) { struct xlog_ticket *tic = ctx->ticket; + uint32_t tid = cpu_to_be32(tic->t_tid); memset(hdr, 0, sizeof(*hdr)); + /* Log start record */ + hdr->oph[0].oh_tid = tid; + hdr->oph[0].oh_clientid = XFS_TRANSACTION; + hdr->oph[0].oh_flags = XLOG_START_TRANS; + + /* log iovec region pointer */ + hdr->lhdr[0].i_addr = &hdr->oph[0]; + hdr->lhdr[0].i_len = sizeof(struct xlog_op_header); + hdr->lhdr[0].i_type = XLOG_REG_TYPE_LRHEADER; + + /* log opheader */ + hdr->oph[1].oh_tid = tid; + hdr->oph[1].oh_clientid = XFS_TRANSACTION; + + /* transaction header */ hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC; hdr->thdr.th_type = XFS_TRANS_CHECKPOINT; - hdr->thdr.th_tid = tic->t_tid; + hdr->thdr.th_tid = tid; hdr->thdr.th_num_items = num_iovecs; - hdr->lhdr.i_addr = &hdr->thdr; - hdr->lhdr.i_len = sizeof(xfs_trans_header_t); - hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR; - tic->t_curr_res -= hdr->lhdr.i_len + sizeof(xlog_op_header_t); - lvhdr->lv_niovecs = 1; - lvhdr->lv_iovecp = &hdr->lhdr; + /* log iovec region pointer */ + hdr->lhdr[1].i_addr = &hdr->oph[1]; + hdr->lhdr[1].i_len = sizeof(struct xlog_op_header) + + sizeof(struct xfs_trans_header); + hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR; + + tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len; + + lvhdr->lv_niovecs = 2; + lvhdr->lv_iovecp = &hdr->lhdr[0]; lvhdr->lv_next = ctx->lv_chain; } From patchwork Fri Mar 5 05:11:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117723 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19345C4332B for ; Fri, 5 Mar 2021 05:12:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E9CDF65013 for ; Fri, 5 Mar 2021 05:12:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229601AbhCEFM2 (ORCPT ); Fri, 5 Mar 2021 00:12:28 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55935 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229609AbhCEFMJ (ORCPT ); Fri, 5 Mar 2021 00:12:09 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 9F4B86309C for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fboj-Rm for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZl-Jz for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Date: Fri, 5 Mar 2021 16:11:19 +1100 Message-Id: <20210305051143.182133-22-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=3V92cHclhPhCIjMfKr4A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Remove another case where xlog_write() has to prepend an opheader to a log transaction. The unmount record + ophdr is smaller than the minimum amount of space guaranteed to be free in an iclog (2 * sizeof(ophdr)) and so we don't have to care about an unmount record being split across 2 iclogs. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_log.c | 35 ++++++++++++++++++++++++----------- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index b2f9fb1b4fed..94711b9ff007 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -798,12 +798,22 @@ xlog_write_unmount_record( struct xlog *log, struct xlog_ticket *ticket) { - struct xfs_unmount_log_format ulf = { - .magic = XLOG_UNMOUNT_TYPE, + struct { + struct xlog_op_header ophdr; + struct xfs_unmount_log_format ulf; + } unmount_rec = { + .ophdr = { + .oh_clientid = XFS_LOG, + .oh_tid = cpu_to_be32(ticket->t_tid), + .oh_flags = XLOG_UNMOUNT_TRANS, + }, + .ulf = { + .magic = XLOG_UNMOUNT_TYPE, + }, }; struct xfs_log_iovec reg = { - .i_addr = &ulf, - .i_len = sizeof(ulf), + .i_addr = &unmount_rec, + .i_len = sizeof(unmount_rec), .i_type = XLOG_REG_TYPE_UNMOUNT, }; struct xfs_log_vec vec = { @@ -812,7 +822,7 @@ xlog_write_unmount_record( }; /* account for space used by record data */ - ticket->t_curr_res -= sizeof(ulf); + ticket->t_curr_res -= sizeof(unmount_rec); /* * For external log devices, we need to flush the data device cache @@ -2138,6 +2148,8 @@ xlog_write_calc_vec_length( /* Don't account for regions with embedded ophdrs */ if (optype && headers > 0) { + if (optype & XLOG_UNMOUNT_TRANS) + headers--; if (optype & XLOG_START_TRANS) { ASSERT(headers >= 2); headers -= 2; @@ -2352,12 +2364,11 @@ xlog_write( /* * If this is a commit or unmount transaction, we don't need a start - * record to be written. We do, however, have to account for the - * commit or unmount header that gets written. Hence we always have - * to account for an extra xlog_op_header here for commit and unmount - * records. + * record to be written. We do, however, have to account for the commit + * header that gets written. Hence we always have to account for an + * extra xlog_op_header here for commit records. */ - if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) + if (optype & XLOG_COMMIT_TRANS) ticket->t_curr_res -= sizeof(struct xlog_op_header); if (ticket->t_curr_res < 0) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, @@ -2428,6 +2439,8 @@ xlog_write( ophdr = reg->i_addr; if (index) optype &= ~XLOG_START_TRANS; + } else if (optype & XLOG_UNMOUNT_TRANS) { + ophdr = reg->i_addr; } else { ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype); @@ -2458,7 +2471,7 @@ xlog_write( /* * Copy region. * - * Commit and unmount records just log an opheader, so + * Commit records just log an opheader, so * we can have empty payloads with no data region to * copy. Hence we only copy the payload if the vector * says it has data to copy. From patchwork Fri Mar 5 05:11:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117743 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35C84C4332E for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1524E65017 for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229608AbhCEFMy (ORCPT ); Fri, 5 Mar 2021 00:12:54 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58883 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229669AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 99E988280CB for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbom-Sx for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZo-L0 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Date: Fri, 5 Mar 2021 16:11:20 +1100 Message-Id: <20210305051143.182133-23-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=o5Hyw1bTQpth9r4-HlMA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org Remove the final case where xlog_write() has to prepend an opheader to a log transaction. Similar to the start record, the commit record is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can just make this the payload for the region being passed to xlog_write() and remove the special handling in xlog_write() for the commit record. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 94711b9ff007..c2e69a1f5cad 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1529,9 +1529,14 @@ xlog_commit_record( struct xlog_in_core **iclog, xfs_lsn_t *lsn) { + struct xlog_op_header ophdr = { + .oh_clientid = XFS_TRANSACTION, + .oh_tid = cpu_to_be32(ticket->t_tid), + .oh_flags = XLOG_COMMIT_TRANS, + }; struct xfs_log_iovec reg = { - .i_addr = NULL, - .i_len = 0, + .i_addr = &ophdr, + .i_len = sizeof(struct xlog_op_header), .i_type = XLOG_REG_TYPE_COMMIT, }; struct xfs_log_vec vec = { @@ -1543,6 +1548,8 @@ xlog_commit_record( if (XLOG_FORCED_SHUTDOWN(log)) return -EIO; + /* account for space used by record data */ + ticket->t_curr_res -= reg.i_len; error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); @@ -2148,11 +2155,10 @@ xlog_write_calc_vec_length( /* Don't account for regions with embedded ophdrs */ if (optype && headers > 0) { - if (optype & XLOG_UNMOUNT_TRANS) - headers--; + headers--; if (optype & XLOG_START_TRANS) { - ASSERT(headers >= 2); - headers -= 2; + ASSERT(headers >= 1); + headers--; } } @@ -2362,14 +2368,6 @@ xlog_write( int data_cnt = 0; int error = 0; - /* - * If this is a commit or unmount transaction, we don't need a start - * record to be written. We do, however, have to account for the commit - * header that gets written. Hence we always have to account for an - * extra xlog_op_header here for commit records. - */ - if (optype & XLOG_COMMIT_TRANS) - ticket->t_curr_res -= sizeof(struct xlog_op_header); if (ticket->t_curr_res < 0) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "ctx ticket reservation ran out. Need to up reservation"); @@ -2433,14 +2431,13 @@ xlog_write( /* * The XLOG_START_TRANS has embedded ophdrs for the * start record and transaction header. They will always - * be the first two regions in the lv chain. + * be the first two regions in the lv chain. Commit and + * unmount records also have embedded ophdrs. */ - if (optype & XLOG_START_TRANS) { + if (optype) { ophdr = reg->i_addr; if (index) optype &= ~XLOG_START_TRANS; - } else if (optype & XLOG_UNMOUNT_TRANS) { - ophdr = reg->i_addr; } else { ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype); From patchwork Fri Mar 5 05:11:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117715 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EABB2C433E0 for ; Fri, 5 Mar 2021 05:12:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C94BB65013 for ; Fri, 5 Mar 2021 05:12:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229575AbhCEFMX (ORCPT ); Fri, 5 Mar 2021 00:12:23 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36896 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229573AbhCEFL7 (ORCPT ); Fri, 5 Mar 2021 00:11:59 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 9A4F8ECB1E3 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbop-US for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZr-MU for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 23/45] xfs: log tickets don't need log client id Date: Fri, 5 Mar 2021 16:11:21 +1100 Message-Id: <20210305051143.182133-24-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=hMR6LrX7Ow0noj7qeasA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set. For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_log.c | 47 ++++++++----------------------------------- fs/xfs/xfs_log.h | 16 ++++++--------- fs/xfs/xfs_log_cil.c | 2 +- fs/xfs/xfs_log_priv.h | 10 ++------- fs/xfs/xfs_trans.c | 6 ++---- 5 files changed, 19 insertions(+), 62 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index c2e69a1f5cad..429cb1e7cc67 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -431,10 +431,9 @@ xfs_log_regrant( int xfs_log_reserve( struct xfs_mount *mp, - int unit_bytes, - int cnt, + int unit_bytes, + int cnt, struct xlog_ticket **ticp, - uint8_t client, bool permanent) { struct xlog *log = mp->m_log; @@ -442,15 +441,13 @@ xfs_log_reserve( int need_bytes; int error = 0; - ASSERT(client == XFS_TRANSACTION || client == XFS_LOG); - if (XLOG_FORCED_SHUTDOWN(log)) return -EIO; XFS_STATS_INC(mp, xs_try_logspace); ASSERT(*ticp == NULL); - tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent); + tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent); *ticp = tic; xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt @@ -847,7 +844,7 @@ xlog_unmount_write( struct xlog_ticket *tic = NULL; int error; - error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); + error = xfs_log_reserve(mp, 600, 1, &tic, 0); if (error) goto out_err; @@ -2170,35 +2167,13 @@ xlog_write_calc_vec_length( static xlog_op_header_t * xlog_write_setup_ophdr( - struct xlog *log, struct xlog_op_header *ophdr, - struct xlog_ticket *ticket, - uint flags) + struct xlog_ticket *ticket) { ophdr->oh_tid = cpu_to_be32(ticket->t_tid); - ophdr->oh_clientid = ticket->t_clientid; + ophdr->oh_clientid = XFS_TRANSACTION; ophdr->oh_res2 = 0; - - /* are we copying a commit or unmount record? */ - ophdr->oh_flags = flags; - - /* - * We've seen logs corrupted with bad transaction client ids. This - * makes sure that XFS doesn't generate them on. Turn this into an EIO - * and shut down the filesystem. - */ - switch (ophdr->oh_clientid) { - case XFS_TRANSACTION: - case XFS_VOLUME: - case XFS_LOG: - break; - default: - xfs_warn(log->l_mp, - "Bad XFS transaction clientid 0x%x in ticket "PTR_FMT, - ophdr->oh_clientid, ticket); - return NULL; - } - + ophdr->oh_flags = 0; return ophdr; } @@ -2439,11 +2414,7 @@ xlog_write( if (index) optype &= ~XLOG_START_TRANS; } else { - ophdr = xlog_write_setup_ophdr(log, ptr, - ticket, optype); - if (!ophdr) - return -EIO; - + ophdr = xlog_write_setup_ophdr(ptr, ticket); xlog_write_adv_cnt(&ptr, &len, &log_offset, sizeof(struct xlog_op_header)); added_ophdr = true; @@ -3499,7 +3470,6 @@ xlog_ticket_alloc( struct xlog *log, int unit_bytes, int cnt, - char client, bool permanent) { struct xlog_ticket *tic; @@ -3517,7 +3487,6 @@ xlog_ticket_alloc( tic->t_cnt = cnt; tic->t_ocnt = cnt; tic->t_tid = prandom_u32(); - tic->t_clientid = client; if (permanent) tic->t_flags |= XLOG_TIC_PERM_RESERV; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 1bd080ce3a95..c0c3141944ea 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -117,16 +117,12 @@ int xfs_log_mount_finish(struct xfs_mount *mp); void xfs_log_mount_cancel(struct xfs_mount *); xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp); -void xfs_log_space_wake(struct xfs_mount *mp); -int xfs_log_reserve(struct xfs_mount *mp, - int length, - int count, - struct xlog_ticket **ticket, - uint8_t clientid, - bool permanent); -int xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic); -void xfs_log_unmount(struct xfs_mount *mp); -int xfs_log_force_umount(struct xfs_mount *mp, int logerror); +void xfs_log_space_wake(struct xfs_mount *mp); +int xfs_log_reserve(struct xfs_mount *mp, int length, int count, + struct xlog_ticket **ticket, bool permanent); +int xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic); +void xfs_log_unmount(struct xfs_mount *mp); +int xfs_log_force_umount(struct xfs_mount *mp, int logerror); bool xfs_log_writable(struct xfs_mount *mp); struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index e9da074ecd69..0c81c13e2cf6 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -37,7 +37,7 @@ xlog_cil_ticket_alloc( { struct xlog_ticket *tic; - tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0); + tic = xlog_ticket_alloc(log, 0, 1, 0); /* * set the current reservation to zero so we know to steal the basic diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index bb5fa6b71114..7f601c1c9f45 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -158,7 +158,6 @@ typedef struct xlog_ticket { int t_unit_res; /* unit reservation in bytes : 4 */ char t_ocnt; /* original count : 1 */ char t_cnt; /* current count : 1 */ - char t_clientid; /* who does this belong to; : 1 */ char t_flags; /* properties of reservation : 1 */ /* reservation array fields */ @@ -465,13 +464,8 @@ extern __le32 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead, char *dp, int size); extern kmem_zone_t *xfs_log_ticket_zone; -struct xlog_ticket * -xlog_ticket_alloc( - struct xlog *log, - int unit_bytes, - int count, - char client, - bool permanent); +struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes, + int count, bool permanent); static inline void xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 52f3fdf1e0de..83c2b7f22eb7 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -194,11 +194,9 @@ xfs_trans_reserve( ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES); error = xfs_log_regrant(mp, tp->t_ticket); } else { - error = xfs_log_reserve(mp, - resp->tr_logres, + error = xfs_log_reserve(mp, resp->tr_logres, resp->tr_logcount, - &tp->t_ticket, XFS_TRANSACTION, - permanent); + &tp->t_ticket, permanent); } if (error) From patchwork Fri Mar 5 05:11:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117691 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE727C433E9 for ; Fri, 5 Mar 2021 05:12:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A492C65004 for ; Fri, 5 Mar 2021 05:12:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229504AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:48083 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229526AbhCEFL5 (ORCPT ); Fri, 5 Mar 2021 00:11:57 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 9A3881ADF43 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kg-00Fbos-V5 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZu-Nl for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 24/45] xfs: move log iovec alignment to preparation function Date: Fri, 5 Mar 2021 16:11:22 +1100 Message-Id: <20210305051143.182133-25-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=H4FU2hCJ8sVSBq7siF0A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To include log op headers directly into the log iovec regions that the ophdrs wrap, we need to move the buffer alignment code from xlog_finish_iovec() to xlog_prepare_iovec(). This is because the xlog_op_header is only 12 bytes long, and we need the buffer that the caller formats their data into to be 8 byte aligned. Hence once we start prepending the ophdr in xlog_prepare_iovec(), we are going to need to manage the padding directly to ensure that the buffer pointer returned is correctly aligned. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_log.h | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index c0c3141944ea..1ca4f2edbdaf 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -21,6 +21,16 @@ struct xfs_log_vec { #define XFS_LOG_VEC_ORDERED (-1) +/* + * We need to make sure the buffer pointer returned is naturally aligned for the + * biggest basic data type we put into it. We have already accounted for this + * padding when sizing the buffer. + * + * However, this padding does not get written into the log, and hence we have to + * track the space used by the log vectors separately to prevent log space hangs + * due to inaccurate accounting (i.e. a leak) of the used log space through the + * CIL context ticket. + */ static inline void * xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, uint type) @@ -34,6 +44,9 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, vec = &lv->lv_iovecp[0]; } + if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t))) + lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t)); + vec->i_type = type; vec->i_addr = lv->lv_buf + lv->lv_buf_len; @@ -43,20 +56,10 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, return vec->i_addr; } -/* - * We need to make sure the next buffer is naturally aligned for the biggest - * basic data type we put into it. We already accounted for this padding when - * sizing the buffer. - * - * However, this padding does not get written into the log, and hence we have to - * track the space used by the log vectors separately to prevent log space hangs - * due to inaccurate accounting (i.e. a leak) of the used log space through the - * CIL context ticket. - */ static inline void xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len) { - lv->lv_buf_len += round_up(len, sizeof(uint64_t)); + lv->lv_buf_len += len; lv->lv_bytes += len; vec->i_len = len; } From patchwork Fri Mar 5 05:11:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117747 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AF65C43331 for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 69DF165014 for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229446AbhCEFMy (ORCPT ); Fri, 5 Mar 2021 00:12:54 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58881 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229679AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 99C6F8278F6 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbov-0J for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000lZx-OZ for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Date: Fri, 5 Mar 2021 16:11:23 +1100 Message-Id: <20210305051143.182133-26-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=hH36q76zEQHc0ou5sH0A:9 a=nsFqJDEMGLFk_TX9:21 a=9dpIQxxcfEYiHEe9:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Current xlog_write() adds op headers to the log manually for every log item region that is in the vector passed to it. While xlog_write() needs to stamp the transaction ID into the ophdr, we already know it's length, flags, clientid, etc at CIL commit time. This means the only time that xlog write really needs to format and reserve space for a new ophdr is when a region is split across two iclogs. Adding the opheader and accounting for it as part of the normal formatted item region means we simplify the accounting of space used by a transaction and we don't have to special case reserving of space in for the ophdrs in xlog_write(). It also means we can largely initialise the ophdr in transaction commit instead of xlog_write, making the xlog_write formatting inner loop much tighter. xlog_prepare_iovec() is now too large to stay as an inline function, so we move it out of line and into xfs_log.c. Object sizes: text data bss dec hex filename 1125934 305951 484 1432369 15db31 fs/xfs/built-in.a.before 1123360 305951 484 1429795 15d123 fs/xfs/built-in.a.after So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now out of line, even though it grew in size itself. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 115 +++++++++++++++++++++++++++++-------------- fs/xfs/xfs_log.h | 42 +++------------- fs/xfs/xfs_log_cil.c | 25 +++++----- 3 files changed, 99 insertions(+), 83 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 429cb1e7cc67..98de45be80c0 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -89,6 +89,62 @@ xlog_iclogs_empty( static int xfs_log_cover(struct xfs_mount *); +/* + * We need to make sure the buffer pointer returned is naturally aligned for the + * biggest basic data type we put into it. We have already accounted for this + * padding when sizing the buffer. + * + * However, this padding does not get written into the log, and hence we have to + * track the space used by the log vectors separately to prevent log space hangs + * due to inaccurate accounting (i.e. a leak) of the used log space through the + * CIL context ticket. + * + * We also add space for the xlog_op_header that describes this region in the + * log. This prepends the data region we return to the caller to copy their data + * into, so do all the static initialisation of the ophdr now. Because the ophdr + * is not 8 byte aligned, we have to be careful to ensure that we align the + * start of the buffer such that the region we return to the call is 8 byte + * aligned and packed against the tail of the ophdr. + */ +void * +xlog_prepare_iovec( + struct xfs_log_vec *lv, + struct xfs_log_iovec **vecp, + uint type) +{ + struct xfs_log_iovec *vec = *vecp; + struct xlog_op_header *oph; + uint32_t len; + void *buf; + + if (vec) { + ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs); + vec++; + } else { + vec = &lv->lv_iovecp[0]; + } + + len = lv->lv_buf_len + sizeof(struct xlog_op_header); + if (!IS_ALIGNED(len, sizeof(uint64_t))) { + lv->lv_buf_len = round_up(len, sizeof(uint64_t)) - + sizeof(struct xlog_op_header); + } + + vec->i_type = type; + vec->i_addr = lv->lv_buf + lv->lv_buf_len; + + oph = vec->i_addr; + oph->oh_clientid = XFS_TRANSACTION; + oph->oh_res2 = 0; + oph->oh_flags = 0; + + buf = vec->i_addr + sizeof(struct xlog_op_header); + ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t))); + + *vecp = vec; + return buf; +} + static void xlog_grant_sub_space( struct xlog *log, @@ -2120,9 +2176,9 @@ xlog_print_trans( } /* - * Calculate the potential space needed by the log vector. If this is a start - * transaction, the caller has already accounted for both opheaders in the start - * transaction, so we don't need to account for them here. + * Calculate the potential space needed by the log vector. All regions contain + * their own opheaders and they are accounted for in region space so we don't + * need to add them to the vector length here. */ static int xlog_write_calc_vec_length( @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length( xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type); } } - - /* Don't account for regions with embedded ophdrs */ - if (optype && headers > 0) { - headers--; - if (optype & XLOG_START_TRANS) { - ASSERT(headers >= 1); - headers--; - } - } - ticket->t_res_num_ophdrs += headers; - len += headers * sizeof(struct xlog_op_header); return len; } @@ -2170,7 +2215,6 @@ xlog_write_setup_ophdr( struct xlog_op_header *ophdr, struct xlog_ticket *ticket) { - ophdr->oh_tid = cpu_to_be32(ticket->t_tid); ophdr->oh_clientid = XFS_TRANSACTION; ophdr->oh_res2 = 0; ophdr->oh_flags = 0; @@ -2404,21 +2448,25 @@ xlog_write( ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); /* - * The XLOG_START_TRANS has embedded ophdrs for the - * start record and transaction header. They will always - * be the first two regions in the lv chain. Commit and - * unmount records also have embedded ophdrs. + * Regions always have their ophdr at the start of the + * region, except for: + * - a transaction start which has a start record ophdr + * before the first region ophdr; and + * - the previous region didn't fully fit into an iclog + * so needs a continuation ophdr to prepend the region + * in this new iclog. */ - if (optype) { - ophdr = reg->i_addr; - if (index) - optype &= ~XLOG_START_TRANS; - } else { + ophdr = reg->i_addr; + if (optype && index) { + optype &= ~XLOG_START_TRANS; + } else if (partial_copy) { ophdr = xlog_write_setup_ophdr(ptr, ticket); xlog_write_adv_cnt(&ptr, &len, &log_offset, sizeof(struct xlog_op_header)); added_ophdr = true; } + ophdr->oh_tid = cpu_to_be32(ticket->t_tid); + len += xlog_write_setup_copy(ticket, ophdr, iclog->ic_size-log_offset, reg->i_len, @@ -2436,20 +2484,11 @@ xlog_write( ophdr->oh_len = cpu_to_be32(copy_len - sizeof(struct xlog_op_header)); } - /* - * Copy region. - * - * Commit records just log an opheader, so - * we can have empty payloads with no data region to - * copy. Hence we only copy the payload if the vector - * says it has data to copy. - */ - ASSERT(copy_len >= 0); - if (copy_len > 0) { - memcpy(ptr, reg->i_addr + copy_off, copy_len); - xlog_write_adv_cnt(&ptr, &len, &log_offset, - copy_len); - } + + ASSERT(copy_len > 0); + memcpy(ptr, reg->i_addr + copy_off, copy_len); + xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len); + if (added_ophdr) copy_len += sizeof(struct xlog_op_header); record_cnt++; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 1ca4f2edbdaf..af54ea3f8c90 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -21,44 +21,18 @@ struct xfs_log_vec { #define XFS_LOG_VEC_ORDERED (-1) -/* - * We need to make sure the buffer pointer returned is naturally aligned for the - * biggest basic data type we put into it. We have already accounted for this - * padding when sizing the buffer. - * - * However, this padding does not get written into the log, and hence we have to - * track the space used by the log vectors separately to prevent log space hangs - * due to inaccurate accounting (i.e. a leak) of the used log space through the - * CIL context ticket. - */ -static inline void * -xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, - uint type) -{ - struct xfs_log_iovec *vec = *vecp; - - if (vec) { - ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs); - vec++; - } else { - vec = &lv->lv_iovecp[0]; - } - - if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t))) - lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t)); - - vec->i_type = type; - vec->i_addr = lv->lv_buf + lv->lv_buf_len; - - ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t))); - - *vecp = vec; - return vec->i_addr; -} +void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, + uint type); static inline void xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len) { + struct xlog_op_header *oph = vec->i_addr; + + /* opheader tracks payload length, logvec tracks region length */ + oph->oh_len = len; + + len += sizeof(struct xlog_op_header); lv->lv_buf_len += len; lv->lv_bytes += len; vec->i_len = len; diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 0c81c13e2cf6..7a5e6bdb7876 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs( } /* - * We 64-bit align the length of each iovec so that the start - * of the next one is naturally aligned. We'll need to - * account for that slack space here. Then round nbytes up - * to 64-bit alignment so that the initial buffer alignment is - * easy to calculate and verify. + * We 64-bit align the length of each iovec so that the start of + * the next one is naturally aligned. We'll need to account for + * that slack space here. + * + * We also add the xlog_op_header to each region when + * formatting, but that's not accounted to the size of the item + * at this point. Hence we'll need an addition number of bytes + * for each vector to hold an opheader. + * + * Then round nbytes up to 64-bit alignment so that the initial + * buffer alignment is easy to calculate and verify. */ - nbytes += niovecs * sizeof(uint64_t); + nbytes += niovecs * + (sizeof(uint64_t) + sizeof(struct xlog_op_header)); nbytes = round_up(nbytes, sizeof(uint64_t)); /* @@ -433,11 +440,6 @@ xlog_cil_insert_items( spin_lock(&cil->xc_cil_lock); - /* account for space used by new iovec headers */ - iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); - len += iovhdr_res; - ctx->nvecs += diff_iovecs; - /* attach the transaction to the CIL if it has any busy extents */ if (!list_empty(&tp->t_busy)) list_splice_init(&tp->t_busy, &ctx->busy_extents); @@ -469,6 +471,7 @@ xlog_cil_insert_items( } tp->t_ticket->t_curr_res -= len; ctx->space_used += len; + ctx->nvecs += diff_iovecs; /* * If we've overrun the reservation, dump the tx details before we move From patchwork Fri Mar 5 05:11:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117757 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5BBE5C43603 for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 313A865015 for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229629AbhCEFM4 (ORCPT ); Fri, 5 Mar 2021 00:12:56 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58879 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229674AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 84E16827CFB for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbox-1C for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000la0-Pg for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 26/45] xfs: log ticket region debug is largely useless Date: Fri, 5 Mar 2021 16:11:24 +1100 Message-Id: <20210305051143.182133-27-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=5EIqd8LjG3Ih9H_cmdUA:9 a=_-q0h3QqZziHmXHg:21 a=aZ5eJxrW-M_mdpSM:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner xlog_tic_add_region() is used to trace the regions being added to a log ticket to provide information in the situation where a ticket reservation overrun occurs. The information gathered is stored int the ticket, and dumped if xlog_print_tic_res() is called. For a front end struct xfs_trans overrun, the ticket only contains reservation tracking information - the ticket is never handed to the log so has no regions attached to it. The overrun debug information in this case comes from xlog_print_trans(), which walks the items attached to the transaction and dumps their attached formatted log vectors directly. It also dumps the ticket state, but that only contains reservation accounting and nothing else. Hence xlog_print_tic_res() never dumps region or overrun information from this path. xlog_tic_add_region() is actually called from xlog_write(), which means it is being used to track the regions seen in a CIL checkpoint log vector chain. In looking at CIL behaviour recently, I've seen 32MB checkpoints regularly exceed 250,000 regions in the LV chain. The log ticket debug code can track *15* regions. IOWs, if there is a ticket overrun in the CIL code, the ticket region tracking code is going to be completely useless for determining what went wrong. The only thing it can tell us is how much of an overrun occurred, and we really don't need extra debug information in the log ticket to tell us that. Indeed, the main place we call xlog_tic_add_region() is also adding up the number of regions and the space used so that xlog_write() knows how much will be written to the log. This is exactly the same information that log ticket is storing once we take away the useless region tracking array. Hence xlog_tic_add_region() is not useful, but can be called 250,000 times a CIL push... Just strip all that debug "information" out of the of the log ticket and only have it report reservation space information when an overrun occurs. This also reduces the size of a log ticket down by about 150 bytes... Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 107 +++--------------------------------------- fs/xfs/xfs_log_priv.h | 17 ------- 2 files changed, 6 insertions(+), 118 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 98de45be80c0..412b167d8d0e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -377,30 +377,6 @@ xlog_grant_head_check( return error; } -static void -xlog_tic_reset_res(xlog_ticket_t *tic) -{ - tic->t_res_num = 0; - tic->t_res_arr_sum = 0; - tic->t_res_num_ophdrs = 0; -} - -static void -xlog_tic_add_region(xlog_ticket_t *tic, uint len, uint type) -{ - if (tic->t_res_num == XLOG_TIC_LEN_MAX) { - /* add to overflow and start again */ - tic->t_res_o_flow += tic->t_res_arr_sum; - tic->t_res_num = 0; - tic->t_res_arr_sum = 0; - } - - tic->t_res_arr[tic->t_res_num].r_len = len; - tic->t_res_arr[tic->t_res_num].r_type = type; - tic->t_res_arr_sum += len; - tic->t_res_num++; -} - bool xfs_log_writable( struct xfs_mount *mp) @@ -448,8 +424,6 @@ xfs_log_regrant( xlog_grant_push_ail(log, tic->t_unit_res); tic->t_curr_res = tic->t_unit_res; - xlog_tic_reset_res(tic); - if (tic->t_cnt > 0) return 0; @@ -2066,63 +2040,11 @@ xlog_print_tic_res( struct xfs_mount *mp, struct xlog_ticket *ticket) { - uint i; - uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t); - - /* match with XLOG_REG_TYPE_* in xfs_log.h */ -#define REG_TYPE_STR(type, str) [XLOG_REG_TYPE_##type] = str - static char *res_type_str[] = { - REG_TYPE_STR(BFORMAT, "bformat"), - REG_TYPE_STR(BCHUNK, "bchunk"), - REG_TYPE_STR(EFI_FORMAT, "efi_format"), - REG_TYPE_STR(EFD_FORMAT, "efd_format"), - REG_TYPE_STR(IFORMAT, "iformat"), - REG_TYPE_STR(ICORE, "icore"), - REG_TYPE_STR(IEXT, "iext"), - REG_TYPE_STR(IBROOT, "ibroot"), - REG_TYPE_STR(ILOCAL, "ilocal"), - REG_TYPE_STR(IATTR_EXT, "iattr_ext"), - REG_TYPE_STR(IATTR_BROOT, "iattr_broot"), - REG_TYPE_STR(IATTR_LOCAL, "iattr_local"), - REG_TYPE_STR(QFORMAT, "qformat"), - REG_TYPE_STR(DQUOT, "dquot"), - REG_TYPE_STR(QUOTAOFF, "quotaoff"), - REG_TYPE_STR(LRHEADER, "LR header"), - REG_TYPE_STR(UNMOUNT, "unmount"), - REG_TYPE_STR(COMMIT, "commit"), - REG_TYPE_STR(TRANSHDR, "trans header"), - REG_TYPE_STR(ICREATE, "inode create"), - REG_TYPE_STR(RUI_FORMAT, "rui_format"), - REG_TYPE_STR(RUD_FORMAT, "rud_format"), - REG_TYPE_STR(CUI_FORMAT, "cui_format"), - REG_TYPE_STR(CUD_FORMAT, "cud_format"), - REG_TYPE_STR(BUI_FORMAT, "bui_format"), - REG_TYPE_STR(BUD_FORMAT, "bud_format"), - }; - BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1); -#undef REG_TYPE_STR - xfs_warn(mp, "ticket reservation summary:"); - xfs_warn(mp, " unit res = %d bytes", - ticket->t_unit_res); - xfs_warn(mp, " current res = %d bytes", - ticket->t_curr_res); - xfs_warn(mp, " total reg = %u bytes (o/flow = %u bytes)", - ticket->t_res_arr_sum, ticket->t_res_o_flow); - xfs_warn(mp, " ophdrs = %u (ophdr space = %u bytes)", - ticket->t_res_num_ophdrs, ophdr_spc); - xfs_warn(mp, " ophdr + reg = %u bytes", - ticket->t_res_arr_sum + ticket->t_res_o_flow + ophdr_spc); - xfs_warn(mp, " num regions = %u", - ticket->t_res_num); - - for (i = 0; i < ticket->t_res_num; i++) { - uint r_type = ticket->t_res_arr[i].r_type; - xfs_warn(mp, "region[%u]: %s - %u bytes", i, - ((r_type <= 0 || r_type > XLOG_REG_TYPE_MAX) ? - "bad-rtype" : res_type_str[r_type]), - ticket->t_res_arr[i].r_len); - } + xfs_warn(mp, " unit res = %d bytes", ticket->t_unit_res); + xfs_warn(mp, " current res = %d bytes", ticket->t_curr_res); + xfs_warn(mp, " original count = %d", ticket->t_ocnt); + xfs_warn(mp, " remaining count = %d", ticket->t_cnt); } /* @@ -2187,7 +2109,6 @@ xlog_write_calc_vec_length( uint optype) { struct xfs_log_vec *lv; - int headers = 0; int len = 0; int i; @@ -2196,17 +2117,9 @@ xlog_write_calc_vec_length( if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) continue; - headers += lv->lv_niovecs; - - for (i = 0; i < lv->lv_niovecs; i++) { - struct xfs_log_iovec *vecp = &lv->lv_iovecp[i]; - - len += vecp->i_len; - xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type); - } + for (i = 0; i < lv->lv_niovecs; i++) + len += lv->lv_iovecp[i].i_len; } - ticket->t_res_num_ophdrs += headers; - return len; } @@ -2265,7 +2178,6 @@ xlog_write_setup_copy( /* account for new log op header */ ticket->t_curr_res -= sizeof(struct xlog_op_header); - ticket->t_res_num_ophdrs++; return sizeof(struct xlog_op_header); } @@ -2973,9 +2885,6 @@ xlog_state_get_iclog_space( */ if (log_offset == 0) { ticket->t_curr_res -= log->l_iclog_hsize; - xlog_tic_add_region(ticket, - log->l_iclog_hsize, - XLOG_REG_TYPE_LRHEADER); head->h_cycle = cpu_to_be32(log->l_curr_cycle); head->h_lsn = cpu_to_be64( xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)); @@ -3055,7 +2964,6 @@ xfs_log_ticket_regrant( xlog_grant_sub_space(log, &log->l_write_head.grant, ticket->t_curr_res); ticket->t_curr_res = ticket->t_unit_res; - xlog_tic_reset_res(ticket); trace_xfs_log_ticket_regrant_sub(log, ticket); @@ -3066,7 +2974,6 @@ xfs_log_ticket_regrant( trace_xfs_log_ticket_regrant_exit(log, ticket); ticket->t_curr_res = ticket->t_unit_res; - xlog_tic_reset_res(ticket); } xfs_log_ticket_put(ticket); @@ -3529,8 +3436,6 @@ xlog_ticket_alloc( if (permanent) tic->t_flags |= XLOG_TIC_PERM_RESERV; - xlog_tic_reset_res(tic); - return tic; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 7f601c1c9f45..8ee6a5f74396 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -139,16 +139,6 @@ enum xlog_iclog_state { /* Ticket reservation region accounting */ #define XLOG_TIC_LEN_MAX 15 -/* - * Reservation region - * As would be stored in xfs_log_iovec but without the i_addr which - * we don't care about. - */ -typedef struct xlog_res { - uint r_len; /* region length :4 */ - uint r_type; /* region's transaction type :4 */ -} xlog_res_t; - typedef struct xlog_ticket { struct list_head t_queue; /* reserve/write queue */ struct task_struct *t_task; /* task that owns this ticket */ @@ -159,13 +149,6 @@ typedef struct xlog_ticket { char t_ocnt; /* original count : 1 */ char t_cnt; /* current count : 1 */ char t_flags; /* properties of reservation : 1 */ - - /* reservation array fields */ - uint t_res_num; /* num in array : 4 */ - uint t_res_num_ophdrs; /* num op hdrs : 4 */ - uint t_res_arr_sum; /* array sum : 4 */ - uint t_res_o_flow; /* sum overflow : 4 */ - xlog_res_t t_res_arr[XLOG_TIC_LEN_MAX]; /* array of res : 8 * 15 */ } xlog_ticket_t; /* From patchwork Fri Mar 5 05:11:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117717 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6DD73C433E6 for ; Fri, 5 Mar 2021 05:12:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4912065013 for ; Fri, 5 Mar 2021 05:12:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229573AbhCEFMY (ORCPT ); Fri, 5 Mar 2021 00:12:24 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36882 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229562AbhCEFL7 (ORCPT ); Fri, 5 Mar 2021 00:11:59 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 88C45ECB21D for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbp1-2d for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000la3-Qf for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 27/45] xfs: pass lv chain length into xlog_write() Date: Fri, 5 Mar 2021 16:11:25 +1100 Message-Id: <20210305051143.182133-28-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=hIND3nFijBJh5Kz5_mcA:9 a=iIF9SUj9uIzfPmGa:21 a=766Es58KMcJUOJTe:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There is no need to iterate the chain to calculate he length of the data in xlog_write_calculate_len() if the caller is already iterating that chain to build it. Passing in the vector length avoids doing an extra chain iteration, which can be a significant amount of work given that large CIL commits can have hundreds of thousands of vectors attached to the chain. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 37 ++++++------------------------------- fs/xfs/xfs_log_cil.c | 18 +++++++++++++----- fs/xfs/xfs_log_priv.h | 2 +- 3 files changed, 20 insertions(+), 37 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 412b167d8d0e..22f97914ab99 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -858,7 +858,8 @@ xlog_write_unmount_record( */ if (log->l_targ != log->l_mp->m_ddev_targp) blkdev_issue_flush(log->l_targ->bt_bdev); - return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); + return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS, + reg.i_len); } /* @@ -1577,7 +1578,8 @@ xlog_commit_record( /* account for space used by record data */ ticket->t_curr_res -= reg.i_len; - error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); + error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS, + reg.i_len); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -2097,32 +2099,6 @@ xlog_print_trans( } } -/* - * Calculate the potential space needed by the log vector. All regions contain - * their own opheaders and they are accounted for in region space so we don't - * need to add them to the vector length here. - */ -static int -xlog_write_calc_vec_length( - struct xlog_ticket *ticket, - struct xfs_log_vec *log_vector, - uint optype) -{ - struct xfs_log_vec *lv; - int len = 0; - int i; - - for (lv = log_vector; lv; lv = lv->lv_next) { - /* we don't write ordered log vectors */ - if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) - continue; - - for (i = 0; i < lv->lv_niovecs; i++) - len += lv->lv_iovecp[i].i_len; - } - return len; -} - static xlog_op_header_t * xlog_write_setup_ophdr( struct xlog_op_header *ophdr, @@ -2285,13 +2261,13 @@ xlog_write( struct xlog_ticket *ticket, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, - uint optype) + uint optype, + uint32_t len) { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv = log_vector; struct xfs_log_iovec *vecp = lv->lv_iovecp; int index = 0; - int len; int partial_copy = 0; int partial_copy_len = 0; int contwr = 0; @@ -2306,7 +2282,6 @@ xlog_write( xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); } - len = xlog_write_calc_vec_length(ticket, log_vector, optype); if (start_lsn) *start_lsn = 0; while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 7a5e6bdb7876..34abc3bae587 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr( sizeof(struct xfs_trans_header); hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR; - tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len; - lvhdr->lv_niovecs = 2; lvhdr->lv_iovecp = &hdr->lhdr[0]; + lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len; lvhdr->lv_next = ctx->lv_chain; + + tic->t_curr_res -= lvhdr->lv_bytes; } /* @@ -742,7 +743,8 @@ xlog_cil_push_work( struct xfs_log_vec *lv; struct xfs_cil_ctx *new_ctx; struct xlog_in_core *commit_iclog; - int num_iovecs; + int num_iovecs = 0; + int num_bytes = 0; int error = 0; struct xlog_cil_trans_hdr thdr; struct xfs_log_vec lvhdr = { NULL }; @@ -841,7 +843,6 @@ xlog_cil_push_work( * by the flush lock. */ lv = NULL; - num_iovecs = 0; while (!list_empty(&cil->xc_cil)) { struct xfs_log_item *item; @@ -855,6 +856,10 @@ xlog_cil_push_work( lv = item->li_lv; item->li_lv = NULL; num_iovecs += lv->lv_niovecs; + + /* we don't write ordered log vectors */ + if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) + num_bytes += lv->lv_bytes; } /* @@ -893,6 +898,9 @@ xlog_cil_push_work( * transaction header here as it is not accounted for in xlog_write(). */ xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs); + num_iovecs += lvhdr.lv_niovecs; + num_bytes += lvhdr.lv_bytes; + /* * Before we format and submit the first iclog, we have to ensure that @@ -907,7 +915,7 @@ xlog_cil_push_work( * write head. */ error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL, - XLOG_START_TRANS); + XLOG_START_TRANS, num_bytes); if (error) goto out_abort_free_ticket; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 8ee6a5f74396..003c11653955 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -462,7 +462,7 @@ void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_ticket *tic, xfs_lsn_t *start_lsn, - struct xlog_in_core **commit_iclog, uint optype); + struct xlog_in_core **commit_iclog, uint optype, uint32_t len); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, From patchwork Fri Mar 5 05:11:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117719 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69D19C43381 for ; Fri, 5 Mar 2021 05:12:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4B5F165013 for ; Fri, 5 Mar 2021 05:12:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229595AbhCEFM1 (ORCPT ); Fri, 5 Mar 2021 00:12:27 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55937 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229597AbhCEFMC (ORCPT ); Fri, 5 Mar 2021 00:12:02 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 9F46E6294D for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbp4-3e for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000la6-SK for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 28/45] xfs: introduce xlog_write_single() Date: Fri, 5 Mar 2021 16:11:26 +1100 Message-Id: <20210305051143.182133-29-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=S83ClQuvNvUpNviAnokA:9 a=0bXxn9q0MV6snEgNplNhOjQmxlI=:19 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Introduce an optimised version of xlog_write() that is used when the entire write will fit in a single iclog. This greatly simplifies the implementation of writing a log vector chain into an iclog, and sets the ground work for a much more understandable xlog_write() implementation. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 22f97914ab99..590c1e6db475 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2214,6 +2214,52 @@ xlog_write_copy_finish( return error; } +/* + * Write log vectors into a single iclog which is guaranteed by the caller + * to have enough space to write the entire log vector into. Return the number + * of log vectors written into the iclog. + */ +static int +xlog_write_single( + struct xfs_log_vec *log_vector, + struct xlog_ticket *ticket, + struct xlog_in_core *iclog, + uint32_t log_offset, + uint32_t len) +{ + struct xfs_log_vec *lv = log_vector; + void *ptr; + int index = 0; + int record_cnt = 0; + + ASSERT(log_offset + len <= iclog->ic_size); + + ptr = iclog->ic_datap + log_offset; + for (lv = log_vector; lv; lv = lv->lv_next) { + /* + * Ordered log vectors have no regions to write so this + * loop will naturally skip them. + */ + for (index = 0; index < lv->lv_niovecs; index++) { + struct xfs_log_iovec *reg = &lv->lv_iovecp[index]; + struct xlog_op_header *ophdr = reg->i_addr; + + ASSERT(reg->i_len % sizeof(int32_t) == 0); + ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); + + ophdr->oh_tid = cpu_to_be32(ticket->t_tid); + ophdr->oh_len = cpu_to_be32(reg->i_len - + sizeof(struct xlog_op_header)); + memcpy(ptr, reg->i_addr, reg->i_len); + xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len); + record_cnt++; + } + } + ASSERT(len == 0); + return record_cnt; +} + + /* * Write some region out to in-core log * @@ -2294,7 +2340,6 @@ xlog_write( return error; ASSERT(log_offset <= iclog->ic_size - 1); - ptr = iclog->ic_datap + log_offset; /* Start_lsn is the first lsn written to. */ if (start_lsn && !*start_lsn) @@ -2311,10 +2356,20 @@ xlog_write( XLOG_ICL_NEED_FUA); } + /* If this is a single iclog write, go fast... */ + if (!contwr && lv == log_vector) { + record_cnt = xlog_write_single(lv, ticket, iclog, + log_offset, len); + len = 0; + data_cnt = len; + break; + } + /* * This loop writes out as many regions as can fit in the amount * of space which was allocated by xlog_state_get_iclog_space(). */ + ptr = iclog->ic_datap + log_offset; while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { struct xfs_log_iovec *reg; struct xlog_op_header *ophdr; From patchwork Fri Mar 5 05:11:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117695 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AD17C433E0 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6294265013 for ; Fri, 5 Mar 2021 05:12:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229526AbhCEFMQ (ORCPT ); Fri, 5 Mar 2021 00:12:16 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:40568 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229559AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 9B01C781970 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbp7-5X for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000la9-TO for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 29/45] xfs:_introduce xlog_write_partial() Date: Fri, 5 Mar 2021 16:11:27 +1100 Message-Id: <20210305051143.182133-30-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=UN3bPJtuUfFEULSoGDgA:9 a=zzFVZPTRXN-JOuVz:21 a=x_Wm7I96Xx3HMxqL:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Handle writing of a logvec chain into an iclog that doesn't have enough space to fit it all. The iclog has already been changed to WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space in the iclog is exclusively owned by this logvec chain. The difference between the single and partial cases is that we end up with partial iovec writes in the iclog and have to split a log vec regions across two iclogs. The state handling for this is currently awful and so we're building up the pieces needed to handle this more cleanly one at a time. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++------------------------- 1 file changed, 251 insertions(+), 274 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 590c1e6db475..10916b99bf0f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2099,166 +2099,250 @@ xlog_print_trans( } } -static xlog_op_header_t * -xlog_write_setup_ophdr( - struct xlog_op_header *ophdr, - struct xlog_ticket *ticket) -{ - ophdr->oh_clientid = XFS_TRANSACTION; - ophdr->oh_res2 = 0; - ophdr->oh_flags = 0; - return ophdr; -} - /* - * Set up the parameters of the region copy into the log. This has - * to handle region write split across multiple log buffers - this - * state is kept external to this function so that this code can - * be written in an obvious, self documenting manner. + * Write whole log vectors into a single iclog which is guaranteed to have + * either sufficient space for the entire log vector chain to be written or + * exclusive access to the remaining space in the iclog. + * + * Return the number of iovecs and data written into the iclog, as well as + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the + * end of the chain. */ -static int -xlog_write_setup_copy( +static struct xfs_log_vec * +xlog_write_single( + struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, - struct xlog_op_header *ophdr, - int space_available, - int space_required, - int *copy_off, - int *copy_len, - int *last_was_partial_copy, - int *bytes_consumed) -{ - int still_to_copy; - - still_to_copy = space_required - *bytes_consumed; - *copy_off = *bytes_consumed; - - if (still_to_copy <= space_available) { - /* write of region completes here */ - *copy_len = still_to_copy; - ophdr->oh_len = cpu_to_be32(*copy_len); - if (*last_was_partial_copy) - ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS); - *last_was_partial_copy = 0; - *bytes_consumed = 0; - return 0; - } - - /* partial write of region, needs extra log op header reservation */ - *copy_len = space_available; - ophdr->oh_len = cpu_to_be32(*copy_len); - ophdr->oh_flags |= XLOG_CONTINUE_TRANS; - if (*last_was_partial_copy) - ophdr->oh_flags |= XLOG_WAS_CONT_TRANS; - *bytes_consumed += *copy_len; - (*last_was_partial_copy)++; - - /* account for new log op header */ - ticket->t_curr_res -= sizeof(struct xlog_op_header); - - return sizeof(struct xlog_op_header); -} - -static int -xlog_write_copy_finish( - struct xlog *log, struct xlog_in_core *iclog, - uint flags, - int *record_cnt, - int *data_cnt, - int *partial_copy, - int *partial_copy_len, - int log_offset, - struct xlog_in_core **commit_iclog) + uint32_t *log_offset, + uint32_t *len, + uint32_t *record_cnt, + uint32_t *data_cnt) { - int error; + struct xfs_log_vec *lv = log_vector; + void *ptr; + int index; - if (*partial_copy) { + ASSERT(*log_offset + *len <= iclog->ic_size || + iclog->ic_state == XLOG_STATE_WANT_SYNC); + + ptr = iclog->ic_datap + *log_offset; + for (lv = log_vector; lv; lv = lv->lv_next) { /* - * This iclog has already been marked WANT_SYNC by - * xlog_state_get_iclog_space. + * If the entire log vec does not fit in the iclog, punt it to + * the partial copy loop which can handle this case. */ - spin_lock(&log->l_icloglock); - xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); - *record_cnt = 0; - *data_cnt = 0; - goto release_iclog; - } + if (lv->lv_niovecs && + lv->lv_bytes > iclog->ic_size - *log_offset) + break; - *partial_copy = 0; - *partial_copy_len = 0; + /* + * Ordered log vectors have no regions to write so this + * loop will naturally skip them. + */ + for (index = 0; index < lv->lv_niovecs; index++) { + struct xfs_log_iovec *reg = &lv->lv_iovecp[index]; + struct xlog_op_header *ophdr = reg->i_addr; - if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) { - /* no more space in this iclog - push it. */ - spin_lock(&log->l_icloglock); - xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); - *record_cnt = 0; - *data_cnt = 0; + ASSERT(reg->i_len % sizeof(int32_t) == 0); + ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); - if (iclog->ic_state == XLOG_STATE_ACTIVE) - xlog_state_switch_iclogs(log, iclog, 0); - else - ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || - iclog->ic_state == XLOG_STATE_IOERROR); - if (!commit_iclog) - goto release_iclog; - spin_unlock(&log->l_icloglock); - ASSERT(flags & XLOG_COMMIT_TRANS); - *commit_iclog = iclog; + ophdr->oh_tid = cpu_to_be32(ticket->t_tid); + ophdr->oh_len = cpu_to_be32(reg->i_len - + sizeof(struct xlog_op_header)); + memcpy(ptr, reg->i_addr, reg->i_len); + xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len); + (*record_cnt)++; + *data_cnt += reg->i_len; + } } + ASSERT(*len == 0 || lv); + return lv; +} - return 0; +static int +xlog_write_get_more_iclog_space( + struct xlog *log, + struct xlog_ticket *ticket, + struct xlog_in_core **iclogp, + uint32_t *log_offset, + uint32_t len, + uint32_t *record_cnt, + uint32_t *data_cnt, + int *contwr) +{ + struct xlog_in_core *iclog = *iclogp; + int error; -release_iclog: + spin_lock(&log->l_icloglock); + xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); + ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || + iclog->ic_state == XLOG_STATE_IOERROR); error = xlog_state_release_iclog(log, iclog); spin_unlock(&log->l_icloglock); - return error; + if (error) + return error; + + error = xlog_state_get_iclog_space(log, len, &iclog, + ticket, contwr, log_offset); + if (error) + return error; + *record_cnt = 0; + *data_cnt = 0; + *iclogp = iclog; + return 0; } /* - * Write log vectors into a single iclog which is guaranteed by the caller - * to have enough space to write the entire log vector into. Return the number - * of log vectors written into the iclog. + * Write log vectors into a single iclog which is smaller than the current chain + * length. We write until we cannot fit a full record into the remaining space + * and then stop. We return the log vector that is to be written that cannot + * wholly fit in the iclog. */ -static int -xlog_write_single( +static struct xfs_log_vec * +xlog_write_partial( + struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, - struct xlog_in_core *iclog, - uint32_t log_offset, - uint32_t len) + struct xlog_in_core **iclogp, + uint32_t *log_offset, + uint32_t *len, + uint32_t *record_cnt, + uint32_t *data_cnt, + int *contwr) { + struct xlog_in_core *iclog = *iclogp; struct xfs_log_vec *lv = log_vector; + struct xfs_log_iovec *reg; + struct xlog_op_header *ophdr; void *ptr; int index = 0; - int record_cnt = 0; + uint32_t rlen; + int error; - ASSERT(log_offset + len <= iclog->ic_size); + /* walk the logvec, copying until we run out of space in the iclog */ + ptr = iclog->ic_datap + *log_offset; + for (index = 0; index < lv->lv_niovecs; index++) { + uint32_t reg_offset = 0; + + reg = &lv->lv_iovecp[index]; + ASSERT(reg->i_len % sizeof(int32_t) == 0); - ptr = iclog->ic_datap + log_offset; - for (lv = log_vector; lv; lv = lv->lv_next) { /* - * Ordered log vectors have no regions to write so this - * loop will naturally skip them. + * The first region of a continuation must have a non-zero + * length otherwise log recovery will just skip over it and + * start recovering from the next opheader it finds. Because we + * mark the next opheader as a continuation, recovery will then + * incorrectly add the continuation to the previous region and + * that breaks stuff. + * + * Hence if there isn't space for region data after the + * opheader, then we need to start afresh with a new iclog. */ - for (index = 0; index < lv->lv_niovecs; index++) { - struct xfs_log_iovec *reg = &lv->lv_iovecp[index]; - struct xlog_op_header *ophdr = reg->i_addr; + if (iclog->ic_size - *log_offset <= + sizeof(struct xlog_op_header)) { + error = xlog_write_get_more_iclog_space(log, ticket, + &iclog, log_offset, *len, record_cnt, + data_cnt, contwr); + if (error) + return ERR_PTR(error); + ptr = iclog->ic_datap + *log_offset; + } - ASSERT(reg->i_len % sizeof(int32_t) == 0); - ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); + ophdr = reg->i_addr; + rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset); + + ophdr->oh_tid = cpu_to_be32(ticket->t_tid); + ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header)); + if (rlen != reg->i_len) + ophdr->oh_flags |= XLOG_CONTINUE_TRANS; + ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); + xlog_verify_dest_ptr(log, ptr); + memcpy(ptr, reg->i_addr, rlen); + xlog_write_adv_cnt(&ptr, len, log_offset, rlen); + (*record_cnt)++; + *data_cnt += rlen; + + if (rlen == reg->i_len) + continue; + + /* + * We now have a partially written iovec, but it can span + * multiple iclogs so we loop here. First we release the iclog + * we currently have, then we get a new iclog and add a new + * opheader. Then we continue copying from where we were until + * we either complete the iovec or fill the iclog. If we + * complete the iovec, then we increment the index and go right + * back to the top of the outer loop. if we fill the iclog, we + * run the inner loop again. + * + * This is complicated by the tail of a region using all the + * space in an iclog and hence requiring us to release the iclog + * and get a new one before returning to the outer loop. We must + * always guarantee that we exit this inner loop with at least + * space for log transaction opheaders left in the current + * iclog, hence we cannot just terminate the loop at the end + * of the of the continuation. So we loop while there is no + * space left in the current iclog, and check for the end of the + * continuation after getting a new iclog. + */ + do { + /* + * Account for the continuation opheader before we get + * a new iclog. This is necessary so that we reserve + * space in the iclog for it. + */ + if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) { + *len += sizeof(struct xlog_op_header); + ticket->t_curr_res -= sizeof(struct xlog_op_header); + } + error = xlog_write_get_more_iclog_space(log, ticket, + &iclog, log_offset, *len, record_cnt, + data_cnt, contwr); + if (error) + return ERR_PTR(error); + ptr = iclog->ic_datap + *log_offset; + + ophdr = ptr; ophdr->oh_tid = cpu_to_be32(ticket->t_tid); - ophdr->oh_len = cpu_to_be32(reg->i_len - + ophdr->oh_clientid = XFS_TRANSACTION; + ophdr->oh_res2 = 0; + ophdr->oh_flags = XLOG_WAS_CONT_TRANS; + + xlog_write_adv_cnt(&ptr, len, log_offset, sizeof(struct xlog_op_header)); - memcpy(ptr, reg->i_addr, reg->i_len); - xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len); - record_cnt++; - } + *data_cnt += sizeof(struct xlog_op_header); + + /* + * If rlen fits in the iclog, then end the region + * continuation. Otherwise we're going around again. + */ + reg_offset += rlen; + rlen = reg->i_len - reg_offset; + if (rlen <= iclog->ic_size - *log_offset) + ophdr->oh_flags |= XLOG_END_TRANS; + else + ophdr->oh_flags |= XLOG_CONTINUE_TRANS; + + rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset); + ophdr->oh_len = cpu_to_be32(rlen); + + xlog_verify_dest_ptr(log, ptr); + memcpy(ptr, reg->i_addr + reg_offset, rlen); + xlog_write_adv_cnt(&ptr, len, log_offset, rlen); + (*record_cnt)++; + *data_cnt += rlen; + + } while (ophdr->oh_flags & XLOG_CONTINUE_TRANS); } - ASSERT(len == 0); - return record_cnt; -} + /* + * No more iovecs remain in this logvec so return the next log vec to + * the caller so it can go back to fast path copying. + */ + *iclogp = iclog; + return lv->lv_next; +} /* * Write some region out to in-core log @@ -2312,14 +2396,11 @@ xlog_write( { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv = log_vector; - struct xfs_log_iovec *vecp = lv->lv_iovecp; - int index = 0; - int partial_copy = 0; - int partial_copy_len = 0; int contwr = 0; int record_cnt = 0; int data_cnt = 0; int error = 0; + int log_offset; if (ticket->t_curr_res < 0) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, @@ -2328,157 +2409,52 @@ xlog_write( xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); } - if (start_lsn) - *start_lsn = 0; - while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { - void *ptr; - int log_offset; - - error = xlog_state_get_iclog_space(log, len, &iclog, ticket, - &contwr, &log_offset); - if (error) - return error; - - ASSERT(log_offset <= iclog->ic_size - 1); + error = xlog_state_get_iclog_space(log, len, &iclog, ticket, + &contwr, &log_offset); + if (error) + return error; - /* Start_lsn is the first lsn written to. */ - if (start_lsn && !*start_lsn) - *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); + /* start_lsn is the LSN of the first iclog written to. */ + if (start_lsn) + *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); - /* - * iclogs containing commit records or unmount records need - * to issue ordering cache flushes and commit immediately - * to stable storage to guarantee journal vs metadata ordering - * is correctly maintained in the storage media. - */ - if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) { - iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | - XLOG_ICL_NEED_FUA); - } + /* + * iclogs containing commit records or unmount records need + * to issue ordering cache flushes and commit immediately + * to stable storage to guarantee journal vs metadata ordering + * is correctly maintained in the storage media. This will always + * fit in the iclog we have been already been passed. + */ + if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) { + iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); + ASSERT(!contwr); + } - /* If this is a single iclog write, go fast... */ - if (!contwr && lv == log_vector) { - record_cnt = xlog_write_single(lv, ticket, iclog, - log_offset, len); - len = 0; - data_cnt = len; + while (lv) { + lv = xlog_write_single(lv, ticket, iclog, &log_offset, + &len, &record_cnt, &data_cnt); + if (!lv) break; - } - - /* - * This loop writes out as many regions as can fit in the amount - * of space which was allocated by xlog_state_get_iclog_space(). - */ - ptr = iclog->ic_datap + log_offset; - while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { - struct xfs_log_iovec *reg; - struct xlog_op_header *ophdr; - int copy_len; - int copy_off; - bool ordered = false; - bool added_ophdr = false; - - /* ordered log vectors have no regions to write */ - if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) { - ASSERT(lv->lv_niovecs == 0); - ordered = true; - goto next_lv; - } - - reg = &vecp[index]; - ASSERT(reg->i_len % sizeof(int32_t) == 0); - ASSERT((unsigned long)ptr % sizeof(int32_t) == 0); - - /* - * Regions always have their ophdr at the start of the - * region, except for: - * - a transaction start which has a start record ophdr - * before the first region ophdr; and - * - the previous region didn't fully fit into an iclog - * so needs a continuation ophdr to prepend the region - * in this new iclog. - */ - ophdr = reg->i_addr; - if (optype && index) { - optype &= ~XLOG_START_TRANS; - } else if (partial_copy) { - ophdr = xlog_write_setup_ophdr(ptr, ticket); - xlog_write_adv_cnt(&ptr, &len, &log_offset, - sizeof(struct xlog_op_header)); - added_ophdr = true; - } - ophdr->oh_tid = cpu_to_be32(ticket->t_tid); - - len += xlog_write_setup_copy(ticket, ophdr, - iclog->ic_size-log_offset, - reg->i_len, - ©_off, ©_len, - &partial_copy, - &partial_copy_len); - xlog_verify_dest_ptr(log, ptr); - - /* - * Wart: need to update length in embedded ophdr not - * to include it's own length. - */ - if (!added_ophdr) { - ophdr->oh_len = cpu_to_be32(copy_len - - sizeof(struct xlog_op_header)); - } - - ASSERT(copy_len > 0); - memcpy(ptr, reg->i_addr + copy_off, copy_len); - xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len); - - if (added_ophdr) - copy_len += sizeof(struct xlog_op_header); - record_cnt++; - data_cnt += contwr ? copy_len : 0; - - error = xlog_write_copy_finish(log, iclog, optype, - &record_cnt, &data_cnt, - &partial_copy, - &partial_copy_len, - log_offset, - commit_iclog); - if (error) - return error; - - /* - * if we had a partial copy, we need to get more iclog - * space but we don't want to increment the region - * index because there is still more is this region to - * write. - * - * If we completed writing this region, and we flushed - * the iclog (indicated by resetting of the record - * count), then we also need to get more log space. If - * this was the last record, though, we are done and - * can just return. - */ - if (partial_copy) - break; - - if (++index == lv->lv_niovecs) { -next_lv: - lv = lv->lv_next; - index = 0; - if (lv) - vecp = lv->lv_iovecp; - } - if (record_cnt == 0 && !ordered) { - if (!lv) - return 0; - break; - } + ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))); + lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset, + &len, &record_cnt, &data_cnt, &contwr); + if (IS_ERR_OR_NULL(lv)) { + error = PTR_ERR_OR_ZERO(lv); + break; } } + ASSERT((len == 0 && !lv) || error); - ASSERT(len == 0); - + /* + * We've already been guaranteed that the last writes will fit inside + * the current iclog, and hence it will already have the space used by + * those writes accounted to it. Hence we do not need to update the + * iclog with the number of bytes written here. + */ + ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log)); spin_lock(&log->l_icloglock); - xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); + xlog_state_finish_copy(log, iclog, record_cnt, 0); if (commit_iclog) { ASSERT(optype & XLOG_COMMIT_TRANS); *commit_iclog = iclog; @@ -2930,7 +2906,7 @@ xlog_state_get_iclog_space( * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's * can fit into remaining data section. */ - if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) { + if (iclog->ic_size - iclog->ic_offset < 3*sizeof(xlog_op_header_t)) { int error = 0; xlog_state_switch_iclogs(log, iclog, iclog->ic_size); @@ -3633,11 +3609,12 @@ xlog_verify_iclog( iclog->ic_header.h_cycle_data[idx]); } } - if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) + if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) { xfs_warn(log->l_mp, - "%s: invalid clientid %d op "PTR_FMT" offset 0x%lx", - __func__, clientid, ophead, + "%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx", + __func__, i, clientid, ophead, (unsigned long)field_offset); + } /* check length */ p = &ophead->oh_len; From patchwork Fri Mar 5 05:11:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117749 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 083B5C43332 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D7DDC65014 for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229523AbhCEFMz (ORCPT ); Fri, 5 Mar 2021 00:12:55 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:41258 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229646AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id B019F78A78B for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpA-6z for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kg-000laC-Uw for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:50 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Date: Fri, 5 Mar 2021 16:11:28 +1100 Message-Id: <20210305051143.182133-31-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=0CMaWtTLTcrcxSC13TsA:9 a=+jEqtf1s3R9VXZ0wqowq2kgwd+I=:19 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The rework of xlog_write() no longer requires xlog_get_iclog_state() to tell it about internal iclog space reservation state to direct it on what to do. Remove this parameter. $ size fs/xfs/xfs_log.o.* text data bss dec hex filename 26520 560 8 27088 69d0 fs/xfs/xfs_log.o.orig 26384 560 8 26952 6948 fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 10916b99bf0f..8f4f7ae84358 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -47,7 +47,6 @@ xlog_state_get_iclog_space( int len, struct xlog_in_core **iclog, struct xlog_ticket *ticket, - int *continued_write, int *logoffsetp); STATIC void xlog_grant_push_ail( @@ -2167,8 +2166,7 @@ xlog_write_get_more_iclog_space( uint32_t *log_offset, uint32_t len, uint32_t *record_cnt, - uint32_t *data_cnt, - int *contwr) + uint32_t *data_cnt) { struct xlog_in_core *iclog = *iclogp; int error; @@ -2182,8 +2180,8 @@ xlog_write_get_more_iclog_space( if (error) return error; - error = xlog_state_get_iclog_space(log, len, &iclog, - ticket, contwr, log_offset); + error = xlog_state_get_iclog_space(log, len, &iclog, ticket, + log_offset); if (error) return error; *record_cnt = 0; @@ -2207,8 +2205,7 @@ xlog_write_partial( uint32_t *log_offset, uint32_t *len, uint32_t *record_cnt, - uint32_t *data_cnt, - int *contwr) + uint32_t *data_cnt) { struct xlog_in_core *iclog = *iclogp; struct xfs_log_vec *lv = log_vector; @@ -2242,7 +2239,7 @@ xlog_write_partial( sizeof(struct xlog_op_header)) { error = xlog_write_get_more_iclog_space(log, ticket, &iclog, log_offset, *len, record_cnt, - data_cnt, contwr); + data_cnt); if (error) return ERR_PTR(error); ptr = iclog->ic_datap + *log_offset; @@ -2298,7 +2295,7 @@ xlog_write_partial( } error = xlog_write_get_more_iclog_space(log, ticket, &iclog, log_offset, *len, record_cnt, - data_cnt, contwr); + data_cnt); if (error) return ERR_PTR(error); ptr = iclog->ic_datap + *log_offset; @@ -2396,7 +2393,6 @@ xlog_write( { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv = log_vector; - int contwr = 0; int record_cnt = 0; int data_cnt = 0; int error = 0; @@ -2410,7 +2406,7 @@ xlog_write( } error = xlog_state_get_iclog_space(log, len, &iclog, ticket, - &contwr, &log_offset); + &log_offset); if (error) return error; @@ -2425,10 +2421,8 @@ xlog_write( * is correctly maintained in the storage media. This will always * fit in the iclog we have been already been passed. */ - if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) { + if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); - ASSERT(!contwr); - } while (lv) { lv = xlog_write_single(lv, ticket, iclog, &log_offset, @@ -2438,7 +2432,7 @@ xlog_write( ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))); lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset, - &len, &record_cnt, &data_cnt, &contwr); + &len, &record_cnt, &data_cnt); if (IS_ERR_OR_NULL(lv)) { error = PTR_ERR_OR_ZERO(lv); break; @@ -2452,7 +2446,6 @@ xlog_write( * those writes accounted to it. Hence we do not need to update the * iclog with the number of bytes written here. */ - ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log)); spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, 0); if (commit_iclog) { @@ -2856,7 +2849,6 @@ xlog_state_get_iclog_space( int len, struct xlog_in_core **iclogp, struct xlog_ticket *ticket, - int *continued_write, int *logoffsetp) { int log_offset; @@ -2932,13 +2924,10 @@ xlog_state_get_iclog_space( * iclogs (to mark it taken), this particular iclog will release/sync * to disk in xlog_write(). */ - if (len <= iclog->ic_size - iclog->ic_offset) { - *continued_write = 0; + if (len <= iclog->ic_size - iclog->ic_offset) iclog->ic_offset += len; - } else { - *continued_write = 1; + else xlog_state_switch_iclogs(log, iclog, iclog->ic_size); - } *iclogp = iclog; ASSERT(iclog->ic_offset <= iclog->ic_size); From patchwork Fri Mar 5 05:11:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117745 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CEC1C433DB for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4020065014 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229517AbhCEFMz (ORCPT ); Fri, 5 Mar 2021 00:12:55 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:59333 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229666AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id BC20D1040F3C for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpD-80 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laF-0G for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Date: Fri, 5 Mar 2021 16:11:29 +1100 Message-Id: <20210305051143.182133-32-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=XTp5-oq1X3BrlmPbjW0A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything. Remove it and the tracking code that calculates it. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 34abc3bae587..4047f95a0fc4 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -252,22 +252,18 @@ xlog_cil_alloc_shadow_bufs( /* * Prepare the log item for insertion into the CIL. Calculate the difference in - * log space and vectors it will consume, and if it is a new item pin it as - * well. + * log space it will consume, and if it is a new item pin it as well. */ STATIC void xfs_cil_prepare_item( struct xlog *log, struct xfs_log_vec *lv, struct xfs_log_vec *old_lv, - int *diff_len, - int *diff_iovecs) + int *diff_len) { /* Account for the new LV being passed in */ - if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) { + if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) *diff_len += lv->lv_bytes; - *diff_iovecs += lv->lv_niovecs; - } /* * If there is no old LV, this is the first time we've seen the item in @@ -284,7 +280,6 @@ xfs_cil_prepare_item( ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED); *diff_len -= old_lv->lv_bytes; - *diff_iovecs -= old_lv->lv_niovecs; lv->lv_item->li_lv_shadow = old_lv; } @@ -333,12 +328,10 @@ static void xlog_cil_insert_format_items( struct xlog *log, struct xfs_trans *tp, - int *diff_len, - int *diff_iovecs) + int *diff_len) { struct xfs_log_item *lip; - /* Bail out if we didn't find a log item. */ if (list_empty(&tp->t_items)) { ASSERT(0); @@ -381,7 +374,6 @@ xlog_cil_insert_format_items( * set the item up as though it is a new insertion so * that the space reservation accounting is correct. */ - *diff_iovecs -= lv->lv_niovecs; *diff_len -= lv->lv_bytes; /* Ensure the lv is set up according to ->iop_size */ @@ -406,7 +398,7 @@ xlog_cil_insert_format_items( ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t))); lip->li_ops->iop_format(lip, lv); insert: - xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs); + xfs_cil_prepare_item(log, lv, old_lv, diff_len); } } @@ -426,7 +418,6 @@ xlog_cil_insert_items( struct xfs_cil_ctx *ctx = cil->xc_ctx; struct xfs_log_item *lip; int len = 0; - int diff_iovecs = 0; int iclog_space; int iovhdr_res = 0, split_res = 0, ctx_res = 0; @@ -436,7 +427,7 @@ xlog_cil_insert_items( * We can do this safely because the context can't checkpoint until we * are done so it doesn't matter exactly how we update the CIL. */ - xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs); + xlog_cil_insert_format_items(log, tp, &len); spin_lock(&cil->xc_cil_lock); @@ -471,7 +462,6 @@ xlog_cil_insert_items( } tp->t_ticket->t_curr_res -= len; ctx->space_used += len; - ctx->nvecs += diff_iovecs; /* * If we've overrun the reservation, dump the tx details before we move From patchwork Fri Mar 5 05:11:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117737 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 206ACC433E6 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 01BAE64FD4 for ; Fri, 5 Mar 2021 05:12:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229500AbhCEFM3 (ORCPT ); Fri, 5 Mar 2021 00:12:29 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:52436 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229592AbhCEFMK (ORCPT ); Fri, 5 Mar 2021 00:12:10 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id EC6EF1078D4 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpG-9D for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laI-1U for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Date: Fri, 5 Mar 2021 16:11:30 +1100 Message-Id: <20210305051143.182133-33-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=PUXlccLH651rOZYfVoIA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replace the list_empty() checks with a flag in the CIL to indicate we have committed at least one transaction to the CIL and hence the CIL is not empty. We need this flag to be an atomic so that we can clear it without holding any locks in the commit fast path, but we also need to be careful to avoid atomic operations in the fast path. Hence we use the fact that test_bit() is not an atomic op to first check if the flag is set and then run the atomic test_and_clear_bit() operation to clear it and steal the initial unit reservation for the CIL context checkpoint. When we are switching to a new context in a push, we place the setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis allows all the other places that need to check whether the CIL is empty to use test_bit() and still be serialised correctly with the CIL context swaps that set the bit. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log_cil.c | 49 +++++++++++++++++++++++-------------------- fs/xfs/xfs_log_priv.h | 4 ++++ 2 files changed, 30 insertions(+), 23 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 4047f95a0fc4..e6e36488f0c7 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -70,6 +70,7 @@ xlog_cil_ctx_switch( struct xfs_cil *cil, struct xfs_cil_ctx *ctx) { + set_bit(XLOG_CIL_EMPTY, &cil->xc_flags); ctx->sequence = ++cil->xc_current_sequence; ctx->cil = cil; cil->xc_ctx = ctx; @@ -436,13 +437,12 @@ xlog_cil_insert_items( list_splice_init(&tp->t_busy, &ctx->busy_extents); /* - * Now transfer enough transaction reservation to the context ticket - * for the checkpoint. The context ticket is special - the unit - * reservation has to grow as well as the current reservation as we - * steal from tickets so we can correctly determine the space used - * during the transaction commit. + * We need to take the CIL checkpoint unit reservation on the first + * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't + * unnecessarily do an atomic op in the fast path here. */ - if (ctx->ticket->t_curr_res == 0) { + if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) && + test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) { ctx_res = ctx->ticket->t_unit_res; ctx->ticket->t_curr_res = ctx_res; tp->t_ticket->t_curr_res -= ctx_res; @@ -771,7 +771,7 @@ xlog_cil_push_work( * move on to a new sequence number and so we have to be able to push * this sequence again later. */ - if (list_empty(&cil->xc_cil)) { + if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) { cil->xc_push_seq = 0; spin_unlock(&cil->xc_push_lock); goto out_skip; @@ -1019,9 +1019,10 @@ xlog_cil_push_background( /* * The cil won't be empty because we are called while holding the - * context lock so whatever we added to the CIL will still be there + * context lock so whatever we added to the CIL will still be there. */ ASSERT(!list_empty(&cil->xc_cil)); + ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); /* * Don't do a background push if we haven't used up all the @@ -1108,7 +1109,8 @@ xlog_cil_push_now( * there's no work we need to do. */ spin_lock(&cil->xc_push_lock); - if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { + if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) || + push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; } @@ -1128,7 +1130,7 @@ xlog_cil_empty( bool empty = false; spin_lock(&cil->xc_push_lock); - if (list_empty(&cil->xc_cil)) + if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) empty = true; spin_unlock(&cil->xc_push_lock); return empty; @@ -1289,7 +1291,7 @@ xlog_cil_force_seq( * we would have found the context on the committing list. */ if (sequence == cil->xc_current_sequence && - !list_empty(&cil->xc_cil)) { + !test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) { spin_unlock(&cil->xc_push_lock); goto restart; } @@ -1320,21 +1322,19 @@ xlog_cil_force_seq( */ bool xfs_log_item_in_current_chkpt( - struct xfs_log_item *lip) + struct xfs_log_item *lip) { - struct xfs_cil_ctx *ctx; + struct xfs_cil *cil = lip->li_mountp->m_log->l_cilp; - if (list_empty(&lip->li_cil)) + if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) return false; - ctx = lip->li_mountp->m_log->l_cilp->xc_ctx; - /* * li_seq is written on the first commit of a log item to record the * first checkpoint it is written to. Hence if it is different to the * current sequence, we're in a new checkpoint. */ - if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0) + if (XFS_LSN_CMP(lip->li_seq, cil->xc_ctx->sequence) != 0) return false; return true; } @@ -1373,13 +1373,16 @@ void xlog_cil_destroy( struct xlog *log) { - if (log->l_cilp->xc_ctx) { - if (log->l_cilp->xc_ctx->ticket) - xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket); - kmem_free(log->l_cilp->xc_ctx); + struct xfs_cil *cil = log->l_cilp; + + if (cil->xc_ctx) { + if (cil->xc_ctx->ticket) + xfs_log_ticket_put(cil->xc_ctx->ticket); + kmem_free(cil->xc_ctx); } - ASSERT(list_empty(&log->l_cilp->xc_cil)); - kmem_free(log->l_cilp); + ASSERT(list_empty(&cil->xc_cil)); + ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); + kmem_free(cil); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 003c11653955..b0dc3bc9de59 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -248,6 +248,7 @@ struct xfs_cil_ctx { */ struct xfs_cil { struct xlog *xc_log; + unsigned long xc_flags; struct list_head xc_cil; spinlock_t xc_cil_lock; @@ -263,6 +264,9 @@ struct xfs_cil { wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp; +/* xc_flags bit values */ +#define XLOG_CIL_EMPTY 1 + /* * The amount of log space we allow the CIL to aggregate is difficult to size. * Whatever we choose, we have to make sure we can get a reservation for the From patchwork Fri Mar 5 05:11:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117705 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FF45C4332D for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1CA5F65013 for ; Fri, 5 Mar 2021 05:12:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229551AbhCEFMR (ORCPT ); Fri, 5 Mar 2021 00:12:17 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:36898 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229564AbhCEFL6 (ORCPT ); Fri, 5 Mar 2021 00:11:58 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id EE28BECB13E for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpI-A7 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laL-2c for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Date: Fri, 5 Mar 2021 16:11:31 +1100 Message-Id: <20210305051143.182133-34-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=63avsRBh0cCWs8EafugA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The xc_cil_lock is the most highly contended lock in XFS now. To start the process of getting rid of it, lift the initial reservation of the CIL log space out from under the xc_cil_lock. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log_cil.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index e6e36488f0c7..50101336a7f4 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -430,23 +430,19 @@ xlog_cil_insert_items( */ xlog_cil_insert_format_items(log, tp, &len); - spin_lock(&cil->xc_cil_lock); - - /* attach the transaction to the CIL if it has any busy extents */ - if (!list_empty(&tp->t_busy)) - list_splice_init(&tp->t_busy, &ctx->busy_extents); - /* * We need to take the CIL checkpoint unit reservation on the first * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't - * unnecessarily do an atomic op in the fast path here. + * unnecessarily do an atomic op in the fast path here. We don't need to + * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are + * under the xc_ctx_lock here and that needs to be held exclusively to + * reset the XLOG_CIL_EMPTY bit. */ if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) && - test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) { + test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) ctx_res = ctx->ticket->t_unit_res; - ctx->ticket->t_curr_res = ctx_res; - tp->t_ticket->t_curr_res -= ctx_res; - } + + spin_lock(&cil->xc_cil_lock); /* do we need space for more log record headers? */ iclog_space = log->l_iclog_size - log->l_iclog_hsize; @@ -456,11 +452,9 @@ xlog_cil_insert_items( /* need to take into account split region headers, too */ split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header); ctx->ticket->t_unit_res += split_res; - ctx->ticket->t_curr_res += split_res; - tp->t_ticket->t_curr_res -= split_res; - ASSERT(tp->t_ticket->t_curr_res >= len); } - tp->t_ticket->t_curr_res -= len; + tp->t_ticket->t_curr_res -= split_res + ctx_res + len; + ctx->ticket->t_curr_res += split_res + ctx_res; ctx->space_used += len; /* @@ -498,6 +492,9 @@ xlog_cil_insert_items( list_move_tail(&lip->li_cil, &cil->xc_cil); } + /* attach the transaction to the CIL if it has any busy extents */ + if (!list_empty(&tp->t_busy)) + list_splice_init(&tp->t_busy, &ctx->busy_extents); spin_unlock(&cil->xc_cil_lock); if (tp->t_ticket->t_curr_res < 0) From patchwork Fri Mar 5 05:11:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117727 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2342EC4332E for ; Fri, 5 Mar 2021 05:12:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 059C565004 for ; Fri, 5 Mar 2021 05:12:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229599AbhCEFM1 (ORCPT ); Fri, 5 Mar 2021 00:12:27 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:59335 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229604AbhCEFMD (ORCPT ); Fri, 5 Mar 2021 00:12:03 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id C90E61040FCB for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpM-Ax for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laO-3P for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 34/45] xfs: rework per-iclog header CIL reservation Date: Fri, 5 Mar 2021 16:11:32 +1100 Message-Id: <20210305051143.182133-35-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=uy_oPLJR2o4LYBG0tAEA:9 a=GhFTH_zsvjk43Kz2:21 a=wsoSsWhywCIb7RL0:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do this accurately with a per-cpu counter without expensive summing of the counter in every commit. However, we know what the maximum CIL size is going to be because of the hard space limit we have, and hence we know exactly how many iclogs we are going to need to write out the CIL. We are constrained by the requirement that small transactions only have reservation space for a single iclog header built into them. At commit time we don't know how much of the current transaction reservation is made up of iclog header reservations as calculated by xfs_log_calc_unit_res() when the ticket was reserved. As larger reservations have multiple header spaces reserved, we can steal more than one iclog header reservation at a time, but we only steal the exact number needed for the given log vector size delta. As a result, we don't know exactly when we are going to steal iclog header reservations, nor do we know exactly how many we are going to need for a given CIL. To make things simple, start by calculating the worst case number of iclog headers a full CIL push will require. Record this into an atomic variable in the CIL. Then add a byte counter to the log ticket that records exactly how much iclog header space has been reserved in this ticket by xfs_log_calc_unit_res(). This tells us exactly how much space we can steal from the ticket at transaction commit time. Now, at transaction commit time, we can check if the CIL has a full iclog header reservation and, if not, steal the entire reservation the current ticket holds for iclog headers. This minimises the number of times we need to do atomic operations in the fast path, but still guarantees we get all the reservations we need. Signed-off-by: Dave Chinner --- fs/xfs/libxfs/xfs_log_rlimit.c | 2 +- fs/xfs/libxfs/xfs_shared.h | 3 +- fs/xfs/xfs_log.c | 12 +++++--- fs/xfs/xfs_log_cil.c | 55 ++++++++++++++++++++++++++-------- fs/xfs/xfs_log_priv.h | 20 +++++++------ 5 files changed, 64 insertions(+), 28 deletions(-) diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c index 7f55eb3f3653..75390134346d 100644 --- a/fs/xfs/libxfs/xfs_log_rlimit.c +++ b/fs/xfs/libxfs/xfs_log_rlimit.c @@ -88,7 +88,7 @@ xfs_log_calc_minimum_size( xfs_log_get_max_trans_res(mp, &tres); - max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres); + max_logres = xfs_log_calc_unit_res(mp, tres.tr_logres, NULL); if (tres.tr_logcount > 1) max_logres *= tres.tr_logcount; diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index 8c61a461bf7b..b4791b817fe3 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -48,7 +48,8 @@ extern const struct xfs_buf_ops xfs_symlink_buf_ops; extern const struct xfs_buf_ops xfs_rtbuf_ops; /* log size calculation functions */ -int xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes); +int xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes, + int *niclogs); int xfs_log_calc_minimum_size(struct xfs_mount *); struct xfs_trans_res; diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 8f4f7ae84358..46a006d41184 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3312,7 +3312,8 @@ xfs_log_ticket_get( static int xlog_calc_unit_res( struct xlog *log, - int unit_bytes) + int unit_bytes, + int *niclogs) { int iclog_space; uint num_headers; @@ -3392,15 +3393,18 @@ xlog_calc_unit_res( /* roundoff padding for transaction data and one for commit record */ unit_bytes += 2 * log->l_iclog_roundoff; + if (niclogs) + *niclogs = num_headers; return unit_bytes; } int xfs_log_calc_unit_res( struct xfs_mount *mp, - int unit_bytes) + int unit_bytes, + int *niclogs) { - return xlog_calc_unit_res(mp->m_log, unit_bytes); + return xlog_calc_unit_res(mp->m_log, unit_bytes, niclogs); } /* @@ -3418,7 +3422,7 @@ xlog_ticket_alloc( tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL); - unit_res = xlog_calc_unit_res(log, unit_bytes); + unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs); atomic_set(&tic->t_ref, 1); tic->t_task = current; diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 50101336a7f4..f8fb2f59e24c 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -44,9 +44,20 @@ xlog_cil_ticket_alloc( * transaction overhead reservation from the first transaction commit. */ tic->t_curr_res = 0; + tic->t_iclog_hdrs = 0; return tic; } +static inline void +xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil) +{ + struct xlog *log = cil->xc_log; + + atomic_set(&cil->xc_iclog_hdrs, + (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) / + (log->l_iclog_size - log->l_iclog_hsize))); +} + /* * Unavoidable forward declaration - xlog_cil_push_work() calls * xlog_cil_ctx_alloc() itself. @@ -70,6 +81,7 @@ xlog_cil_ctx_switch( struct xfs_cil *cil, struct xfs_cil_ctx *ctx) { + xlog_cil_set_iclog_hdr_count(cil); set_bit(XLOG_CIL_EMPTY, &cil->xc_flags); ctx->sequence = ++cil->xc_current_sequence; ctx->cil = cil; @@ -92,6 +104,7 @@ xlog_cil_init_post_recovery( { log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log); log->l_cilp->xc_ctx->sequence = 1; + xlog_cil_set_iclog_hdr_count(log->l_cilp); } static inline int @@ -419,7 +432,6 @@ xlog_cil_insert_items( struct xfs_cil_ctx *ctx = cil->xc_ctx; struct xfs_log_item *lip; int len = 0; - int iclog_space; int iovhdr_res = 0, split_res = 0, ctx_res = 0; ASSERT(tp); @@ -442,19 +454,36 @@ xlog_cil_insert_items( test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) ctx_res = ctx->ticket->t_unit_res; - spin_lock(&cil->xc_cil_lock); - - /* do we need space for more log record headers? */ - iclog_space = log->l_iclog_size - log->l_iclog_hsize; - if (len > 0 && (ctx->space_used / iclog_space != - (ctx->space_used + len) / iclog_space)) { - split_res = (len + iclog_space - 1) / iclog_space; - /* need to take into account split region headers, too */ - split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header); - ctx->ticket->t_unit_res += split_res; + /* + * Check if we need to steal iclog headers. atomic_read() is not a + * locked atomic operation, so we can check the value before we do any + * real atomic ops in the fast path. If we've already taken the CIL unit + * reservation from this commit, we've already got one iclog header + * space reserved so we have to account for that otherwise we risk + * overrunning the reservation on this ticket. + * + * If the CIL is already at the hard limit, we might need more header + * space that originally reserved. So steal more header space from every + * commit that occurs once we are over the hard limit to ensure the CIL + * push won't run out of reservation space. + * + * This can steal more than we need, but that's OK. + */ + if (atomic_read(&cil->xc_iclog_hdrs) > 0 || + ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + int split_res = log->l_iclog_hsize + + sizeof(struct xlog_op_header); + if (ctx_res) + ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1); + else + ctx_res = split_res * tp->t_ticket->t_iclog_hdrs; + atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs); } - tp->t_ticket->t_curr_res -= split_res + ctx_res + len; - ctx->ticket->t_curr_res += split_res + ctx_res; + + spin_lock(&cil->xc_cil_lock); + tp->t_ticket->t_curr_res -= ctx_res + len; + ctx->ticket->t_unit_res += ctx_res; + ctx->ticket->t_curr_res += ctx_res; ctx->space_used += len; /* diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b0dc3bc9de59..e72d14c76e03 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -140,15 +140,16 @@ enum xlog_iclog_state { #define XLOG_TIC_LEN_MAX 15 typedef struct xlog_ticket { - struct list_head t_queue; /* reserve/write queue */ - struct task_struct *t_task; /* task that owns this ticket */ - xlog_tid_t t_tid; /* transaction identifier : 4 */ - atomic_t t_ref; /* ticket reference count : 4 */ - int t_curr_res; /* current reservation in bytes : 4 */ - int t_unit_res; /* unit reservation in bytes : 4 */ - char t_ocnt; /* original count : 1 */ - char t_cnt; /* current count : 1 */ - char t_flags; /* properties of reservation : 1 */ + struct list_head t_queue; /* reserve/write queue */ + struct task_struct *t_task; /* task that owns this ticket */ + xlog_tid_t t_tid; /* transaction identifier */ + atomic_t t_ref; /* ticket reference count */ + int t_curr_res; /* current reservation */ + int t_unit_res; /* unit reservation */ + char t_ocnt; /* original count */ + char t_cnt; /* current count */ + char t_flags; /* properties of reservation */ + int t_iclog_hdrs; /* iclog hdrs in t_curr_res */ } xlog_ticket_t; /* @@ -249,6 +250,7 @@ struct xfs_cil_ctx { struct xfs_cil { struct xlog *xc_log; unsigned long xc_flags; + atomic_t xc_iclog_hdrs; struct list_head xc_cil; spinlock_t xc_cil_lock; From patchwork Fri Mar 5 05:11:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117721 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B00FAC433DB for ; Fri, 5 Mar 2021 05:12:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 90DA465013 for ; Fri, 5 Mar 2021 05:12:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229562AbhCEFM0 (ORCPT ); Fri, 5 Mar 2021 00:12:26 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:59337 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229599AbhCEFMD (ORCPT ); Fri, 5 Mar 2021 00:12:03 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id C85671040F7B for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpP-CP for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laR-4P for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Date: Fri, 5 Mar 2021 16:11:33 +1100 Message-Id: <20210305051143.182133-36-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=Eqg3tRyrYyYRg9lh8iMA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures. We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list. Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 94 ++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_log_priv.h | 15 ++++++ include/linux/cpuhotplug.h | 1 + 3 files changed, 110 insertions(+) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index f8fb2f59e24c..1bcf0d423d30 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1365,6 +1365,93 @@ xfs_log_item_in_current_chkpt( return true; } +#ifdef CONFIG_HOTPLUG_CPU +static LIST_HEAD(xlog_cil_pcp_list); +static DEFINE_SPINLOCK(xlog_cil_pcp_lock); +static bool xlog_cil_pcp_init; + +static int +xlog_cil_pcp_dead( + unsigned int cpu) +{ + struct xfs_cil *cil; + + spin_lock(&xlog_cil_pcp_lock); + list_for_each_entry(cil, &xlog_cil_pcp_list, xc_pcp_list) { + /* move stuff on dead CPU to context */ + } + spin_unlock(&xlog_cil_pcp_lock); + return 0; +} + +static int +xlog_cil_pcp_hpadd( + struct xfs_cil *cil) +{ + if (!xlog_cil_pcp_init) { + int ret; + ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD, + "xfs/cil_pcp:dead", NULL, + xlog_cil_pcp_dead); + if (ret < 0) { + xfs_warn(cil->xc_log->l_mp, + "Failed to initialise CIL hotplug, error %d. XFS is non-functional.", + ret); + ASSERT(0); + return -ENOMEM; + } + xlog_cil_pcp_init = true; + } + + INIT_LIST_HEAD(&cil->xc_pcp_list); + spin_lock(&xlog_cil_pcp_lock); + list_add(&cil->xc_pcp_list, &xlog_cil_pcp_list); + spin_unlock(&xlog_cil_pcp_lock); + return 0; +} + +static void +xlog_cil_pcp_hpremove( + struct xfs_cil *cil) +{ + spin_lock(&xlog_cil_pcp_lock); + list_del(&cil->xc_pcp_list); + spin_unlock(&xlog_cil_pcp_lock); +} + +#else /* !CONFIG_HOTPLUG_CPU */ +static inline void xlog_cil_pcp_hpadd(struct xfs_cil *cil) {} +static inline void xlog_cil_pcp_hpremove(struct xfs_cil *cil) {} +#endif + +static void __percpu * +xlog_cil_pcp_alloc( + struct xfs_cil *cil) +{ + struct xlog_cil_pcp *cilpcp; + + cilpcp = alloc_percpu(struct xlog_cil_pcp); + if (!cilpcp) + return NULL; + + if (xlog_cil_pcp_hpadd(cil) < 0) { + free_percpu(cilpcp); + return NULL; + } + return cilpcp; +} + +static void +xlog_cil_pcp_free( + struct xfs_cil *cil, + struct xlog_cil_pcp *cilpcp) +{ + if (!cilpcp) + return; + xlog_cil_pcp_hpremove(cil); + free_percpu(cilpcp); +} + /* * Perform initial CIL structure initialisation. */ @@ -1379,6 +1466,12 @@ xlog_cil_init( if (!cil) return -ENOMEM; + cil->xc_pcp = xlog_cil_pcp_alloc(cil); + if (!cil->xc_pcp) { + kmem_free(cil); + return -ENOMEM; + } + INIT_LIST_HEAD(&cil->xc_cil); INIT_LIST_HEAD(&cil->xc_committing); spin_lock_init(&cil->xc_cil_lock); @@ -1409,6 +1502,7 @@ xlog_cil_destroy( ASSERT(list_empty(&cil->xc_cil)); ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); + xlog_cil_pcp_free(cil, cil->xc_pcp); kmem_free(cil); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index e72d14c76e03..2562f29c8986 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -231,6 +231,16 @@ struct xfs_cil_ctx { struct work_struct push_work; }; +/* + * Per-cpu CIL tracking items + */ +struct xlog_cil_pcp { + uint32_t space_used; + uint32_t curr_res; + struct list_head busy_extents; + struct list_head log_items; +}; + /* * Committed Item List structure * @@ -264,6 +274,11 @@ struct xfs_cil { wait_queue_head_t xc_commit_wait; xfs_csn_t xc_current_sequence; wait_queue_head_t xc_push_wait; /* background push throttle */ + + struct xlog_cil_pcp __percpu *xc_pcp; +#ifdef CONFIG_HOTPLUG_CPU + struct list_head xc_pcp_list; +#endif } ____cacheline_aligned_in_smp; /* xc_flags bit values */ diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index f14adb882338..b13b21d825b3 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -52,6 +52,7 @@ enum cpuhp_state { CPUHP_FS_BUFF_DEAD, CPUHP_PRINTK_DEAD, CPUHP_MM_MEMCQ_DEAD, + CPUHP_XFS_CIL_DEAD, CPUHP_PERCPU_CNT_DEAD, CPUHP_RADIX_DEAD, CPUHP_PAGE_ALLOC_DEAD, From patchwork Fri Mar 5 05:11:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117735 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99602C43381 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7BD1765015 for ; Fri, 5 Mar 2021 05:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229512AbhCEFMy (ORCPT ); Fri, 5 Mar 2021 00:12:54 -0500 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:41256 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229642AbhCEFMN (ORCPT ); Fri, 5 Mar 2021 00:12:13 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 0501D78AFF3 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpR-D9 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laU-5a for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 36/45] xfs: implement percpu cil space used calculation Date: Fri, 5 Mar 2021 16:11:34 +1100 Message-Id: <20210305051143.182133-37-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=YFFqHMdf3F2HtVUQwpcA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now that we have the CIL percpu structures in place, implement the space used counter with a fast sum check similar to the percpu_counter infrastructure. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 42 ++++++++++++++++++++++++++++++++++++------ fs/xfs/xfs_log_priv.h | 2 +- 2 files changed, 37 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 1bcf0d423d30..5519d112c1fd 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -433,6 +433,8 @@ xlog_cil_insert_items( struct xfs_log_item *lip; int len = 0; int iovhdr_res = 0, split_res = 0, ctx_res = 0; + int space_used; + struct xlog_cil_pcp *cilpcp; ASSERT(tp); @@ -469,8 +471,9 @@ xlog_cil_insert_items( * * This can steal more than we need, but that's OK. */ + space_used = atomic_read(&ctx->space_used); if (atomic_read(&cil->xc_iclog_hdrs) > 0 || - ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { int split_res = log->l_iclog_hsize + sizeof(struct xlog_op_header); if (ctx_res) @@ -480,16 +483,34 @@ xlog_cil_insert_items( atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs); } + /* + * Update the CIL percpu pointer. This updates the global counter when + * over the percpu batch size or when the CIL is over the space limit. + * This means low lock overhead for normal updates, and when over the + * limit the space used is immediately accounted. This makes enforcing + * the hard limit much more accurate. The per cpu fold threshold is + * based on how close we are to the hard limit. + */ + cilpcp = get_cpu_ptr(cil->xc_pcp); + cilpcp->space_used += len; + if (space_used >= XLOG_CIL_SPACE_LIMIT(log) || + cilpcp->space_used > + ((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) / + num_online_cpus())) { + atomic_add(cilpcp->space_used, &ctx->space_used); + cilpcp->space_used = 0; + } + put_cpu_ptr(cilpcp); + spin_lock(&cil->xc_cil_lock); - tp->t_ticket->t_curr_res -= ctx_res + len; ctx->ticket->t_unit_res += ctx_res; ctx->ticket->t_curr_res += ctx_res; - ctx->space_used += len; /* * If we've overrun the reservation, dump the tx details before we move * the log items. Shutdown is imminent... */ + tp->t_ticket->t_curr_res -= ctx_res + len; if (WARN_ON(tp->t_ticket->t_curr_res < 0)) { xfs_warn(log->l_mp, "Transaction log reservation overrun:"); xfs_warn(log->l_mp, @@ -769,12 +790,20 @@ xlog_cil_push_work( struct bio bio; DECLARE_COMPLETION_ONSTACK(bdev_flush); bool commit_iclog_sync = false; + int cpu; + struct xlog_cil_pcp *cilpcp; new_ctx = xlog_cil_ctx_alloc(); new_ctx->ticket = xlog_cil_ticket_alloc(log); down_write(&cil->xc_ctx_lock); + /* Reset the CIL pcp counters */ + for_each_online_cpu(cpu) { + cilpcp = per_cpu_ptr(cil->xc_pcp, cpu); + cilpcp->space_used = 0; + } + spin_lock(&cil->xc_push_lock); push_seq = cil->xc_push_seq; ASSERT(push_seq <= ctx->sequence); @@ -1042,6 +1071,7 @@ xlog_cil_push_background( struct xlog *log) __releases(cil->xc_ctx_lock) { struct xfs_cil *cil = log->l_cilp; + int space_used = atomic_read(&cil->xc_ctx->space_used); /* * The cil won't be empty because we are called while holding the @@ -1054,7 +1084,7 @@ xlog_cil_push_background( * Don't do a background push if we haven't used up all the * space available yet. */ - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + if (space_used < XLOG_CIL_SPACE_LIMIT(log)) { up_read(&cil->xc_ctx_lock); return; } @@ -1083,10 +1113,10 @@ xlog_cil_push_background( * The ctx->xc_push_lock provides the serialisation necessary for safely * using the lockless waitqueue_active() check in this context. */ - if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) || + if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) || waitqueue_active(&cil->xc_push_wait)) { trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); - ASSERT(cil->xc_ctx->space_used < log->l_logsize); + ASSERT(space_used < log->l_logsize); xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock); return; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 2562f29c8986..4eb373357f26 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -222,7 +222,7 @@ struct xfs_cil_ctx { xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ struct xlog_ticket *ticket; /* chkpt ticket */ int nvecs; /* number of regions */ - int space_used; /* aggregate size of regions */ + atomic_t space_used; /* aggregate size of regions */ struct list_head busy_extents; /* busy extents in chkpt */ struct xfs_log_vec *lv_chain; /* logvecs being pushed */ struct list_head iclog_entry; From patchwork Fri Mar 5 05:11:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117755 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3BFEC433E9 for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9EDBE65004 for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229631AbhCEFM4 (ORCPT ); Fri, 5 Mar 2021 00:12:56 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58877 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229552AbhCEFMS (ORCPT ); Fri, 5 Mar 2021 00:12:18 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E9320827B24 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpV-EK for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laX-6W for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Date: Fri, 5 Mar 2021 16:11:35 +1100 Message-Id: <20210305051143.182133-38-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=OVMH6THyl8vKEaUu_IkA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To get it out from under the cil spinlock. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 11 ++++++----- fs/xfs/xfs_log_priv.h | 2 +- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 5519d112c1fd..a2f93bd7644b 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -492,6 +492,7 @@ xlog_cil_insert_items( * based on how close we are to the hard limit. */ cilpcp = get_cpu_ptr(cil->xc_pcp); + cilpcp->space_reserved += ctx_res; cilpcp->space_used += len; if (space_used >= XLOG_CIL_SPACE_LIMIT(log) || cilpcp->space_used > @@ -502,10 +503,6 @@ xlog_cil_insert_items( } put_cpu_ptr(cilpcp); - spin_lock(&cil->xc_cil_lock); - ctx->ticket->t_unit_res += ctx_res; - ctx->ticket->t_curr_res += ctx_res; - /* * If we've overrun the reservation, dump the tx details before we move * the log items. Shutdown is imminent... @@ -527,6 +524,7 @@ xlog_cil_insert_items( * We do this here so we only need to take the CIL lock once during * the transaction commit. */ + spin_lock(&cil->xc_cil_lock); list_for_each_entry(lip, &tp->t_items, li_trans) { /* Skip items which aren't dirty in this transaction. */ @@ -798,10 +796,13 @@ xlog_cil_push_work( down_write(&cil->xc_ctx_lock); - /* Reset the CIL pcp counters */ + /* Aggregate and reset the CIL pcp counters */ for_each_online_cpu(cpu) { cilpcp = per_cpu_ptr(cil->xc_pcp, cpu); + ctx->ticket->t_curr_res += cilpcp->space_reserved; cilpcp->space_used = 0; + cilpcp->space_reserved = 0; + } spin_lock(&cil->xc_push_lock); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 4eb373357f26..278b9eaea582 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -236,7 +236,7 @@ struct xfs_cil_ctx { */ struct xlog_cil_pcp { uint32_t space_used; - uint32_t curr_res; + uint32_t space_reserved; struct list_head busy_extents; struct list_head log_items; }; From patchwork Fri Mar 5 05:11:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117729 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE737C433DB for ; Fri, 5 Mar 2021 05:12:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 98CA164FD4 for ; Fri, 5 Mar 2021 05:12:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229602AbhCEFM3 (ORCPT ); Fri, 5 Mar 2021 00:12:29 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:52434 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229500AbhCEFMK (ORCPT ); Fri, 5 Mar 2021 00:12:10 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id EED2C107B83 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00FbpY-Ey for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laa-7I for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Date: Fri, 5 Mar 2021 16:11:36 +1100 Message-Id: <20210305051143.182133-39-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=3A6U50GPjzoxoPTa4dIA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To get them out from under the CIL lock. This is an unordered list, so we can simply punt it to per-cpu lists during transaction commits and reaggregate it back into a single list during the CIL push work. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index a2f93bd7644b..7428b98c8279 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -501,6 +501,9 @@ xlog_cil_insert_items( atomic_add(cilpcp->space_used, &ctx->space_used); cilpcp->space_used = 0; } + /* attach the transaction to the CIL if it has any busy extents */ + if (!list_empty(&tp->t_busy)) + list_splice_init(&tp->t_busy, &cilpcp->busy_extents); put_cpu_ptr(cilpcp); /* @@ -540,9 +543,6 @@ xlog_cil_insert_items( list_move_tail(&lip->li_cil, &cil->xc_cil); } - /* attach the transaction to the CIL if it has any busy extents */ - if (!list_empty(&tp->t_busy)) - list_splice_init(&tp->t_busy, &ctx->busy_extents); spin_unlock(&cil->xc_cil_lock); if (tp->t_ticket->t_curr_res < 0) @@ -802,7 +802,10 @@ xlog_cil_push_work( ctx->ticket->t_curr_res += cilpcp->space_reserved; cilpcp->space_used = 0; cilpcp->space_reserved = 0; - + if (!list_empty(&cilpcp->busy_extents)) { + list_splice_init(&cilpcp->busy_extents, + &ctx->busy_extents); + } } spin_lock(&cil->xc_push_lock); @@ -1459,17 +1462,24 @@ static void __percpu * xlog_cil_pcp_alloc( struct xfs_cil *cil) { + void __percpu *pcptr; struct xlog_cil_pcp *cilpcp; + int cpu; - cilpcp = alloc_percpu(struct xlog_cil_pcp); - if (!cilpcp) + pcptr = alloc_percpu(struct xlog_cil_pcp); + if (!pcptr) return NULL; + for_each_possible_cpu(cpu) { + cilpcp = per_cpu_ptr(pcptr, cpu); + INIT_LIST_HEAD(&cilpcp->busy_extents); + } + if (xlog_cil_pcp_hpadd(cil) < 0) { - free_percpu(cilpcp); + free_percpu(pcptr); return NULL; } - return cilpcp; + return pcptr; } static void From patchwork Fri Mar 5 05:11:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117751 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0544EC4321A for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D878B65014 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229627AbhCEFM4 (ORCPT ); Fri, 5 Mar 2021 00:12:56 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55937 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229687AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id E9AED630B5 for ; Fri, 5 Mar 2021 16:11:51 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpb-Fo for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000lad-8G for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 39/45] xfs: Add order IDs to log items in CIL Date: Fri, 5 Mar 2021 16:11:37 +1100 Message-Id: <20210305051143.182133-40-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=CQLCtQqWn6iW24E7Yj8A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rules around the order in which related items must physically appear in the log even inside a single checkpoint transaction. An example of this is intents - an intent must appear in the log before it's intent done record so taht log recovery can cancel the intent correctly. If we have these two records misordered in the CIL, then they will not be recovered correctly by journal replay. We also will not be able to move items to the tail of the CIL list when they are relogged, hence the log items will need some mechanism to allow the correct log item order to be recreated before we write log items to the hournal. Hence we need to have a mechanism for recording global order of transactions in the log items so that we can recover that order from un-ordered per-cpu lists. Do this with a simple monotonic increasing commit counter in the CIL context. Each log item in the transaction gets stamped with the current commit order ID before it is added to the CIL. If the item is already in the CIL, leave it where it is instead of moving it to the tail of the list and instead sort the list before we start the push work. XXX: list_sort() under the cil_ctx_lock held exclusive starts hurting that >16 threads. Front end commits are waiting on the push to switch contexts much longer. The item order id should likely be moved into the logvecs when they are detacted from the items, then the sort can be done on the logvec after the cil_ctx_lock has been released. logvecs will need to use a list_head for this rather than a single linked list like they do now.... Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 34 ++++++++++++++++++++++++++-------- fs/xfs/xfs_log_priv.h | 1 + fs/xfs/xfs_trans.h | 1 + 3 files changed, 28 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 7428b98c8279..7420389f4cee 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -434,6 +434,7 @@ xlog_cil_insert_items( int len = 0; int iovhdr_res = 0, split_res = 0, ctx_res = 0; int space_used; + int order; struct xlog_cil_pcp *cilpcp; ASSERT(tp); @@ -523,10 +524,12 @@ xlog_cil_insert_items( } /* - * Now (re-)position everything modified at the tail of the CIL. + * Now update the order of everything modified in the transaction + * and insert items into the CIL if they aren't already there. * We do this here so we only need to take the CIL lock once during * the transaction commit. */ + order = atomic_inc_return(&ctx->order_id); spin_lock(&cil->xc_cil_lock); list_for_each_entry(lip, &tp->t_items, li_trans) { @@ -534,13 +537,10 @@ xlog_cil_insert_items( if (!test_bit(XFS_LI_DIRTY, &lip->li_flags)) continue; - /* - * Only move the item if it isn't already at the tail. This is - * to prevent a transient list_empty() state when reinserting - * an item that is already the only item in the CIL. - */ - if (!list_is_last(&lip->li_cil, &cil->xc_cil)) - list_move_tail(&lip->li_cil, &cil->xc_cil); + lip->li_order_id = order; + if (!list_empty(&lip->li_cil)) + continue; + list_add(&lip->li_cil, &cil->xc_cil); } spin_unlock(&cil->xc_cil_lock); @@ -753,6 +753,22 @@ xlog_cil_build_trans_hdr( tic->t_curr_res -= lvhdr->lv_bytes; } +static int +xlog_cil_order_cmp( + void *priv, + struct list_head *a, + struct list_head *b) +{ + struct xfs_log_item *l1 = container_of(a, struct xfs_log_item, li_cil); + struct xfs_log_item *l2 = container_of(b, struct xfs_log_item, li_cil); + + if (l1->li_order_id > l2->li_order_id) + return 1; + if (l1->li_order_id < l2->li_order_id) + return -1; + return 0; +} + /* * Push the Committed Item List to the log. * @@ -891,6 +907,7 @@ xlog_cil_push_work( * needed on the transaction commit side which is currently locked out * by the flush lock. */ + list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp); lv = NULL; while (!list_empty(&cil->xc_cil)) { struct xfs_log_item *item; @@ -898,6 +915,7 @@ xlog_cil_push_work( item = list_first_entry(&cil->xc_cil, struct xfs_log_item, li_cil); list_del_init(&item->li_cil); + item->li_order_id = 0; if (!ctx->lv_chain) ctx->lv_chain = item->li_lv; else diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 278b9eaea582..92d9e1a03a07 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -229,6 +229,7 @@ struct xfs_cil_ctx { struct list_head committing; /* ctx committing list */ struct work_struct discard_endio_work; struct work_struct push_work; + atomic_t order_id; }; /* diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 6276c7d251e6..226c0f5e7870 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -44,6 +44,7 @@ struct xfs_log_item { struct xfs_log_vec *li_lv; /* active log vector */ struct xfs_log_vec *li_lv_shadow; /* standby vector */ xfs_csn_t li_seq; /* CIL commit seq */ + uint32_t li_order_id; /* CIL commit order */ }; /* From patchwork Fri Mar 5 05:11:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117759 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE9A9C432C3 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8B70D65013 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229616AbhCEFMz (ORCPT ); Fri, 5 Mar 2021 00:12:55 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:56791 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229688AbhCEFMQ (ORCPT ); Fri, 5 Mar 2021 00:12:16 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 0FF9062F93 for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpe-Gq for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000lag-9C for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Date: Fri, 5 Mar 2021 16:11:38 +1100 Message-Id: <20210305051143.182133-41-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=LeOBFmjP2KjtsCjRVtEA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list like the busy extent list and reconstruct it before the push starts. This is what we're trying to avoid: - 75.35% 1.83% [kernel] [k] xfs_log_commit_cil - 46.35% xfs_log_commit_cil - 41.54% _raw_spin_lock - 67.30% do_raw_spin_lock 66.96% __pv_queued_spin_lock_slowpath Which happens on a 32p system when running a 32-way 'rm -rf' workload. After this patch: - 20.90% 3.23% [kernel] [k] xfs_log_commit_cil - 17.67% xfs_log_commit_cil - 6.51% xfs_log_ticket_ungrant 1.40% xfs_log_space_wake 2.32% memcpy_erms - 2.18% xfs_buf_item_committing - 2.12% xfs_buf_item_release - 1.03% xfs_buf_unlock 0.96% up 0.72% xfs_buf_rele 1.33% xfs_inode_item_format 1.19% down_read 0.91% up_read 0.76% xfs_buf_item_format - 0.68% kmem_alloc_large - 0.67% kmem_alloc 0.64% __kmalloc 0.50% xfs_buf_item_size It kinda looks like the workload is running out of log space all the time. But all the spinlock contention is gone and the transaction commit rate has gone from 800k/s to 1.3M/s so the amount of real work being done has gone up a *lot*. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 61 ++++++++++++++++++++----------------------- fs/xfs/xfs_log_priv.h | 2 -- 2 files changed, 29 insertions(+), 34 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 7420389f4cee..3d43a5088154 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -448,10 +448,9 @@ xlog_cil_insert_items( /* * We need to take the CIL checkpoint unit reservation on the first * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't - * unnecessarily do an atomic op in the fast path here. We don't need to - * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are - * under the xc_ctx_lock here and that needs to be held exclusively to - * reset the XLOG_CIL_EMPTY bit. + * unnecessarily do an atomic op in the fast path here. We can clear the + * XLOG_CIL_EMPTY bit as we are under the xc_ctx_lock here and that + * needs to be held exclusively to reset the XLOG_CIL_EMPTY bit. */ if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) && test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) @@ -505,24 +504,6 @@ xlog_cil_insert_items( /* attach the transaction to the CIL if it has any busy extents */ if (!list_empty(&tp->t_busy)) list_splice_init(&tp->t_busy, &cilpcp->busy_extents); - put_cpu_ptr(cilpcp); - - /* - * If we've overrun the reservation, dump the tx details before we move - * the log items. Shutdown is imminent... - */ - tp->t_ticket->t_curr_res -= ctx_res + len; - if (WARN_ON(tp->t_ticket->t_curr_res < 0)) { - xfs_warn(log->l_mp, "Transaction log reservation overrun:"); - xfs_warn(log->l_mp, - " log items: %d bytes (iov hdrs: %d bytes)", - len, iovhdr_res); - xfs_warn(log->l_mp, " split region headers: %d bytes", - split_res); - xfs_warn(log->l_mp, " ctx ticket: %d bytes", ctx_res); - xlog_print_trans(tp); - } - /* * Now update the order of everything modified in the transaction * and insert items into the CIL if they aren't already there. @@ -530,7 +511,6 @@ xlog_cil_insert_items( * the transaction commit. */ order = atomic_inc_return(&ctx->order_id); - spin_lock(&cil->xc_cil_lock); list_for_each_entry(lip, &tp->t_items, li_trans) { /* Skip items which aren't dirty in this transaction. */ @@ -540,10 +520,26 @@ xlog_cil_insert_items( lip->li_order_id = order; if (!list_empty(&lip->li_cil)) continue; - list_add(&lip->li_cil, &cil->xc_cil); + list_add(&lip->li_cil, &cilpcp->log_items); + } + put_cpu_ptr(cilpcp); + + /* + * If we've overrun the reservation, dump the tx details before we move + * the log items. Shutdown is imminent... + */ + tp->t_ticket->t_curr_res -= ctx_res + len; + if (WARN_ON(tp->t_ticket->t_curr_res < 0)) { + xfs_warn(log->l_mp, "Transaction log reservation overrun:"); + xfs_warn(log->l_mp, + " log items: %d bytes (iov hdrs: %d bytes)", + len, iovhdr_res); + xfs_warn(log->l_mp, " split region headers: %d bytes", + split_res); + xfs_warn(log->l_mp, " ctx ticket: %d bytes", ctx_res); + xlog_print_trans(tp); } - spin_unlock(&cil->xc_cil_lock); if (tp->t_ticket->t_curr_res < 0) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); @@ -806,6 +802,7 @@ xlog_cil_push_work( bool commit_iclog_sync = false; int cpu; struct xlog_cil_pcp *cilpcp; + LIST_HEAD (log_items); new_ctx = xlog_cil_ctx_alloc(); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -822,6 +819,9 @@ xlog_cil_push_work( list_splice_init(&cilpcp->busy_extents, &ctx->busy_extents); } + if (!list_empty(&cilpcp->log_items)) { + list_splice_init(&cilpcp->log_items, &log_items); + } } spin_lock(&cil->xc_push_lock); @@ -907,12 +907,12 @@ xlog_cil_push_work( * needed on the transaction commit side which is currently locked out * by the flush lock. */ - list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp); + list_sort(NULL, &log_items, xlog_cil_order_cmp); lv = NULL; - while (!list_empty(&cil->xc_cil)) { + while (!list_empty(&log_items)) { struct xfs_log_item *item; - item = list_first_entry(&cil->xc_cil, + item = list_first_entry(&log_items, struct xfs_log_item, li_cil); list_del_init(&item->li_cil); item->li_order_id = 0; @@ -1099,7 +1099,6 @@ xlog_cil_push_background( * The cil won't be empty because we are called while holding the * context lock so whatever we added to the CIL will still be there. */ - ASSERT(!list_empty(&cil->xc_cil)); ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); /* @@ -1491,6 +1490,7 @@ xlog_cil_pcp_alloc( for_each_possible_cpu(cpu) { cilpcp = per_cpu_ptr(pcptr, cpu); INIT_LIST_HEAD(&cilpcp->busy_extents); + INIT_LIST_HEAD(&cilpcp->log_items); } if (xlog_cil_pcp_hpadd(cil) < 0) { @@ -1531,9 +1531,7 @@ xlog_cil_init( return -ENOMEM; } - INIT_LIST_HEAD(&cil->xc_cil); INIT_LIST_HEAD(&cil->xc_committing); - spin_lock_init(&cil->xc_cil_lock); spin_lock_init(&cil->xc_push_lock); init_waitqueue_head(&cil->xc_push_wait); init_rwsem(&cil->xc_ctx_lock); @@ -1559,7 +1557,6 @@ xlog_cil_destroy( kmem_free(cil->xc_ctx); } - ASSERT(list_empty(&cil->xc_cil)); ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); xlog_cil_pcp_free(cil, cil->xc_pcp); kmem_free(cil); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 92d9e1a03a07..12a1a36eef7e 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -262,8 +262,6 @@ struct xfs_cil { struct xlog *xc_log; unsigned long xc_flags; atomic_t xc_iclog_hdrs; - struct list_head xc_cil; - spinlock_t xc_cil_lock; struct rw_semaphore xc_ctx_lock ____cacheline_aligned_in_smp; struct xfs_cil_ctx *xc_ctx; From patchwork Fri Mar 5 05:11:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117763 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03942C4360C for ; Fri, 5 Mar 2021 05:12:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C6A9065013 for ; Fri, 5 Mar 2021 05:12:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229552AbhCEFM5 (ORCPT ); Fri, 5 Mar 2021 00:12:57 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:58881 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229578AbhCEFMT (ORCPT ); Fri, 5 Mar 2021 00:12:19 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 09A1D828113 for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpg-Hz for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000laj-AE for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 41/45] xfs: move CIL ordering to the logvec chain Date: Fri, 5 Mar 2021 16:11:39 +1100 Message-Id: <20210305051143.182133-42-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=xvtivy5LyCy7Q2jPmfEA:9 a=efn6qEVgltjbn7IR:21 a=EvgdL0HIMYCb13tF:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Adding a list_sort() call to the CIL push work while the xc_ctx_lock is held exclusively has resulted in fairly long lock hold times and that stops all front end transaction commits from making progress. We can move the sorting out of the xc_ctx_lock if we can transfer the ordering information to the log vectors as they are detached from the log items and then we can sort the log vectors. This requires log vectors to use a list_head rather than a single linked list and to hold an order ID field. With these changes, we can move the list_sort() call to just before we call xlog_write() when we aren't holding any locks at all. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 46 +++++++++++++++++++++--------- fs/xfs/xfs_log.h | 3 +- fs/xfs/xfs_log_cil.c | 63 +++++++++++++++++++++++++---------------- fs/xfs/xfs_log_priv.h | 4 +-- fs/xfs/xfs_trans.c | 4 +-- fs/xfs/xfs_trans_priv.h | 4 +-- 6 files changed, 78 insertions(+), 46 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 46a006d41184..fd58c3213ebf 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -846,6 +846,9 @@ xlog_write_unmount_record( .lv_niovecs = 1, .lv_iovecp = ®, }; + LIST_HEAD(lv_chain); + INIT_LIST_HEAD(&vec.lv_chain); + list_add(&vec.lv_chain, &lv_chain); /* account for space used by record data */ ticket->t_curr_res -= sizeof(unmount_rec); @@ -857,8 +860,8 @@ xlog_write_unmount_record( */ if (log->l_targ != log->l_mp->m_ddev_targp) blkdev_issue_flush(log->l_targ->bt_bdev); - return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS, - reg.i_len); + return xlog_write(log, &lv_chain, ticket, NULL, NULL, + XLOG_UNMOUNT_TRANS, reg.i_len); } /* @@ -1571,14 +1574,17 @@ xlog_commit_record( .lv_iovecp = ®, }; int error; + LIST_HEAD(lv_chain); + INIT_LIST_HEAD(&vec.lv_chain); + list_add(&vec.lv_chain, &lv_chain); if (XLOG_FORCED_SHUTDOWN(log)) return -EIO; /* account for space used by record data */ ticket->t_curr_res -= reg.i_len; - error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS, - reg.i_len); + error = xlog_write(log, &lv_chain, ticket, lsn, iclog, + XLOG_COMMIT_TRANS, reg.i_len); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -2109,6 +2115,7 @@ xlog_print_trans( */ static struct xfs_log_vec * xlog_write_single( + struct list_head *lv_chain, struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, struct xlog_in_core *iclog, @@ -2117,7 +2124,7 @@ xlog_write_single( uint32_t *record_cnt, uint32_t *data_cnt) { - struct xfs_log_vec *lv = log_vector; + struct xfs_log_vec *lv; void *ptr; int index; @@ -2125,10 +2132,13 @@ xlog_write_single( iclog->ic_state == XLOG_STATE_WANT_SYNC); ptr = iclog->ic_datap + *log_offset; - for (lv = log_vector; lv; lv = lv->lv_next) { + for (lv = log_vector; + !list_entry_is_head(lv, lv_chain, lv_chain); + lv = list_next_entry(lv, lv_chain)) { /* - * If the entire log vec does not fit in the iclog, punt it to - * the partial copy loop which can handle this case. + * If the log vec contains data that needs to be copied and does + * not entirely fit in the iclog, punt it to the partial copy + * loop which can handle this case. */ if (lv->lv_niovecs && lv->lv_bytes > iclog->ic_size - *log_offset) @@ -2154,6 +2164,8 @@ xlog_write_single( *data_cnt += reg->i_len; } } + if (list_entry_is_head(lv, lv_chain, lv_chain)) + lv = NULL; ASSERT(*len == 0 || lv); return lv; } @@ -2199,6 +2211,7 @@ xlog_write_get_more_iclog_space( static struct xfs_log_vec * xlog_write_partial( struct xlog *log, + struct list_head *lv_chain, struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, struct xlog_in_core **iclogp, @@ -2338,7 +2351,10 @@ xlog_write_partial( * the caller so it can go back to fast path copying. */ *iclogp = iclog; - return lv->lv_next; + lv = list_next_entry(lv, lv_chain); + if (list_entry_is_head(lv, lv_chain, lv_chain)) + return NULL; + return lv; } /* @@ -2384,7 +2400,7 @@ xlog_write_partial( int xlog_write( struct xlog *log, - struct xfs_log_vec *log_vector, + struct list_head *lv_chain, struct xlog_ticket *ticket, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, @@ -2392,7 +2408,7 @@ xlog_write( uint32_t len) { struct xlog_in_core *iclog = NULL; - struct xfs_log_vec *lv = log_vector; + struct xfs_log_vec *lv; int record_cnt = 0; int data_cnt = 0; int error = 0; @@ -2424,15 +2440,17 @@ xlog_write( if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); + lv = list_first_entry_or_null(lv_chain, struct xfs_log_vec, lv_chain); while (lv) { - lv = xlog_write_single(lv, ticket, iclog, &log_offset, + lv = xlog_write_single(lv_chain, lv, ticket, iclog, &log_offset, &len, &record_cnt, &data_cnt); if (!lv) break; ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))); - lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset, - &len, &record_cnt, &data_cnt); + lv = xlog_write_partial(log, lv_chain, lv, ticket, &iclog, + &log_offset, &len, &record_cnt, + &data_cnt); if (IS_ERR_OR_NULL(lv)) { error = PTR_ERR_OR_ZERO(lv); break; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index af54ea3f8c90..0445dd6acbce 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -9,7 +9,8 @@ struct xfs_cil_ctx; struct xfs_log_vec { - struct xfs_log_vec *lv_next; /* next lv in build list */ + struct list_head lv_chain; /* lv chain ptrs */ + int lv_order_id; /* chain ordering info */ int lv_niovecs; /* number of iovecs in lv */ struct xfs_log_iovec *lv_iovecp; /* iovec array */ struct xfs_log_item *lv_item; /* owner */ diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 3d43a5088154..6dcc23829bef 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void) ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS); INIT_LIST_HEAD(&ctx->committing); INIT_LIST_HEAD(&ctx->busy_extents); + INIT_LIST_HEAD(&ctx->lv_chain); INIT_WORK(&ctx->push_work, xlog_cil_push_work); return ctx; } @@ -237,6 +238,7 @@ xlog_cil_alloc_shadow_bufs( lv = kmem_alloc_large(buf_size, KM_NOFS); memset(lv, 0, xlog_cil_iovec_space(niovecs)); + INIT_LIST_HEAD(&lv->lv_chain); lv->lv_item = lip; lv->lv_size = buf_size; if (ordered) @@ -252,7 +254,6 @@ xlog_cil_alloc_shadow_bufs( else lv->lv_buf_len = 0; lv->lv_bytes = 0; - lv->lv_next = NULL; } /* Ensure the lv is set up according to ->iop_size */ @@ -379,8 +380,6 @@ xlog_cil_insert_format_items( if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) { /* same or smaller, optimise common overwrite case */ lv = lip->li_lv; - lv->lv_next = NULL; - if (ordered) goto insert; @@ -547,14 +546,14 @@ xlog_cil_insert_items( static void xlog_cil_free_logvec( - struct xfs_log_vec *log_vector) + struct list_head *lv_chain) { struct xfs_log_vec *lv; - for (lv = log_vector; lv; ) { - struct xfs_log_vec *next = lv->lv_next; + while(!list_empty(lv_chain)) { + lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_chain); + list_del_init(&lv->lv_chain); kmem_free(lv); - lv = next; } } @@ -653,7 +652,7 @@ xlog_cil_committed( spin_unlock(&ctx->cil->xc_push_lock); } - xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain, + xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain, ctx->start_lsn, abort); xfs_extent_busy_sort(&ctx->busy_extents); @@ -664,7 +663,7 @@ xlog_cil_committed( list_del(&ctx->committing); spin_unlock(&ctx->cil->xc_push_lock); - xlog_cil_free_logvec(ctx->lv_chain); + xlog_cil_free_logvec(&ctx->lv_chain); if (!list_empty(&ctx->busy_extents)) xlog_discard_busy_extents(mp, ctx); @@ -744,7 +743,7 @@ xlog_cil_build_trans_hdr( lvhdr->lv_niovecs = 2; lvhdr->lv_iovecp = &hdr->lhdr[0]; lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len; - lvhdr->lv_next = ctx->lv_chain; + list_add(&lvhdr->lv_chain, &ctx->lv_chain); tic->t_curr_res -= lvhdr->lv_bytes; } @@ -755,12 +754,14 @@ xlog_cil_order_cmp( struct list_head *a, struct list_head *b) { - struct xfs_log_item *l1 = container_of(a, struct xfs_log_item, li_cil); - struct xfs_log_item *l2 = container_of(b, struct xfs_log_item, li_cil); + struct xfs_log_vec *l1 = container_of(a, struct xfs_log_vec, + lv_chain); + struct xfs_log_vec *l2 = container_of(b, struct xfs_log_vec, + lv_chain); - if (l1->li_order_id > l2->li_order_id) + if (l1->lv_order_id > l2->lv_order_id) return 1; - if (l1->li_order_id < l2->li_order_id) + if (l1->lv_order_id < l2->lv_order_id) return -1; return 0; } @@ -907,26 +908,25 @@ xlog_cil_push_work( * needed on the transaction commit side which is currently locked out * by the flush lock. */ - list_sort(NULL, &log_items, xlog_cil_order_cmp); lv = NULL; while (!list_empty(&log_items)) { struct xfs_log_item *item; item = list_first_entry(&log_items, struct xfs_log_item, li_cil); - list_del_init(&item->li_cil); - item->li_order_id = 0; - if (!ctx->lv_chain) - ctx->lv_chain = item->li_lv; - else - lv->lv_next = item->li_lv; + lv = item->li_lv; - item->li_lv = NULL; + lv->lv_order_id = item->li_order_id; num_iovecs += lv->lv_niovecs; - /* we don't write ordered log vectors */ if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) num_bytes += lv->lv_bytes; + list_add_tail(&lv->lv_chain, &ctx->lv_chain); + + list_del_init(&item->li_cil); + item->li_order_id = 0; + item->li_lv = NULL; + } /* @@ -959,6 +959,13 @@ xlog_cil_push_work( spin_unlock(&cil->xc_push_lock); up_write(&cil->xc_ctx_lock); + /* + * Sort the log vector chain before we add the transaction headers. + * This ensures we always have the transaction headers at the start + * of the chain. + */ + list_sort(NULL, &ctx->lv_chain, xlog_cil_order_cmp); + /* * Build a checkpoint transaction header and write it to the log to * begin the transaction. We need to account for the space used by the @@ -981,8 +988,14 @@ xlog_cil_push_work( * use the commit record lsn then we can move the tail beyond the grant * write head. */ - error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL, - XLOG_START_TRANS, num_bytes); + error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn, + NULL, XLOG_START_TRANS, num_bytes); + + /* + * Take the lvhdr back off the lv_chain as it should not be passed + * to log IO completion. + */ + list_del(&lvhdr.lv_chain); if (error) goto out_abort_free_ticket; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 12a1a36eef7e..6a4160200417 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -224,7 +224,7 @@ struct xfs_cil_ctx { int nvecs; /* number of regions */ atomic_t space_used; /* aggregate size of regions */ struct list_head busy_extents; /* busy extents in chkpt */ - struct xfs_log_vec *lv_chain; /* logvecs being pushed */ + struct list_head lv_chain; /* logvecs being pushed */ struct list_head iclog_entry; struct list_head committing; /* ctx committing list */ struct work_struct discard_endio_work; @@ -480,7 +480,7 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); -int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, +int xlog_write(struct xlog *log, struct list_head *lv_chain, struct xlog_ticket *tic, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, uint optype, uint32_t len); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 83c2b7f22eb7..b20e68279808 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -747,7 +747,7 @@ xfs_log_item_batch_insert( void xfs_trans_committed_bulk( struct xfs_ail *ailp, - struct xfs_log_vec *log_vector, + struct list_head *lv_chain, xfs_lsn_t commit_lsn, bool aborted) { @@ -762,7 +762,7 @@ xfs_trans_committed_bulk( spin_unlock(&ailp->ail_lock); /* unpin all the log items */ - for (lv = log_vector; lv; lv = lv->lv_next ) { + list_for_each_entry(lv, lv_chain, lv_chain) { struct xfs_log_item *lip = lv->lv_item; xfs_lsn_t item_lsn; diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 3004aeac9110..b0bf78e6ff76 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -18,8 +18,8 @@ void xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *); void xfs_trans_del_item(struct xfs_log_item *); void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); -void xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv, - xfs_lsn_t commit_lsn, bool aborted); +void xfs_trans_committed_bulk(struct xfs_ail *ailp, + struct list_head *lv_chain, xfs_lsn_t commit_lsn, bool aborted); /* * AIL traversal cursor. * From patchwork Fri Mar 5 05:11:40 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117687 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD8CCC433E0 for ; Fri, 5 Mar 2021 05:12:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8A9AF64FD4 for ; Fri, 5 Mar 2021 05:12:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229629AbhCEFMC (ORCPT ); Fri, 5 Mar 2021 00:12:02 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:51834 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229520AbhCEFL4 (ORCPT ); Fri, 5 Mar 2021 00:11:56 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 12FA4107890 for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpk-It for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000lam-BL for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Date: Fri, 5 Mar 2021 16:11:40 +1100 Message-Id: <20210305051143.182133-43-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=sMu_4l5kZS4VzaOAmzIA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner - 21.92% __xfs_trans_commit - 21.62% xfs_log_commit_cil - 11.69% xfs_trans_unreserve_and_mod_sb - 11.58% __percpu_counter_compare - 11.45% __percpu_counter_sum - 10.29% _raw_spin_lock_irqsave - 10.28% do_raw_spin_lock __pv_queued_spin_lock_slowpath We debated just getting rid of it last time this came up and there was no real objection to removing it. Now it's the biggest scalability limitation for debug kernels even on smallish machines, so let's just get rid of it. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_trans.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index b20e68279808..637d084c8aa8 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -616,19 +616,12 @@ xfs_trans_unreserve_and_mod_sb( ASSERT(!error); } - if (idelta) { + if (idelta) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH); - if (idelta < 0) - ASSERT(__percpu_counter_compare(&mp->m_icount, 0, - XFS_ICOUNT_BATCH) >= 0); - } - if (ifreedelta) { + if (ifreedelta) percpu_counter_add(&mp->m_ifree, ifreedelta); - if (ifreedelta < 0) - ASSERT(percpu_counter_compare(&mp->m_ifree, 0) >= 0); - } if (rtxdelta == 0 && !(tp->t_flags & XFS_TRANS_SB_DIRTY)) return; From patchwork Fri Mar 5 05:11:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117709 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D460C433E6 for ; Fri, 5 Mar 2021 05:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5C32F65014 for ; Fri, 5 Mar 2021 05:12:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229465AbhCEFMS (ORCPT ); Fri, 5 Mar 2021 00:12:18 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:59331 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229595AbhCEFMA (ORCPT ); Fri, 5 Mar 2021 00:12:00 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 13DBE1041374 for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpm-Jf for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000lap-C5 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 43/45] xfs: avoid cil push lock if possible Date: Fri, 5 Mar 2021 16:11:41 +1100 Message-Id: <20210305051143.182133-44-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=F6O72S5bnyPC6Df_giEA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Because now it hurts when the CIL fills up. - 37.20% __xfs_trans_commit - 35.84% xfs_log_commit_cil - 19.34% _raw_spin_lock - do_raw_spin_lock 19.01% __pv_queued_spin_lock_slowpath - 4.20% xfs_log_ticket_ungrant 0.90% xfs_log_space_wake Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 6dcc23829bef..d60c72ad391a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1115,10 +1115,18 @@ xlog_cil_push_background( ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)); /* - * Don't do a background push if we haven't used up all the - * space available yet. + * We are done if: + * - we haven't used up all the space available yet; or + * - we've already queued up a push; and + * - we're not over the hard limit; and + * - nothing has been over the hard limit. + * + * If so, we don't need to take the push lock as there's nothing to do. */ - if (space_used < XLOG_CIL_SPACE_LIMIT(log)) { + if (space_used < XLOG_CIL_SPACE_LIMIT(log) || + (cil->xc_push_seq == cil->xc_current_sequence && + space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log) && + !waitqueue_active(&cil->xc_push_wait))) { up_read(&cil->xc_ctx_lock); return; } From patchwork Fri Mar 5 05:11:42 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117761 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F3F1C43381 for ; Fri, 5 Mar 2021 05:12:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4461365014 for ; Fri, 5 Mar 2021 05:12:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229578AbhCEFM5 (ORCPT ); Fri, 5 Mar 2021 00:12:57 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:32828 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229564AbhCEFMS (ORCPT ); Fri, 5 Mar 2021 00:12:18 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 5F7518278C6 for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpq-Ke for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000las-Cw for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Date: Fri, 5 Mar 2021 16:11:42 +1100 Message-Id: <20210305051143.182133-45-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=vhzcWsXmtM9S7ZmMwX4A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket. The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit. Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice. That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 39 +++++++++++++++++++++++++-------------- fs/xfs/xfs_log_cil.c | 19 ++++++++++++++----- fs/xfs/xfs_log_priv.h | 3 ++- 3 files changed, 41 insertions(+), 20 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index fd58c3213ebf..1c7d522b12cd 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -55,7 +55,8 @@ xlog_grant_push_ail( STATIC void xlog_sync( struct xlog *log, - struct xlog_in_core *iclog); + struct xlog_in_core *iclog, + struct xlog_ticket *ticket); #if defined(DEBUG) STATIC void xlog_verify_dest_ptr( @@ -535,7 +536,8 @@ __xlog_state_release_iclog( int xlog_state_release_iclog( struct xlog *log, - struct xlog_in_core *iclog) + struct xlog_in_core *iclog, + struct xlog_ticket *ticket) { lockdep_assert_held(&log->l_icloglock); @@ -545,7 +547,7 @@ xlog_state_release_iclog( if (atomic_dec_and_test(&iclog->ic_refcnt) && __xlog_state_release_iclog(log, iclog)) { spin_unlock(&log->l_icloglock); - xlog_sync(log, iclog); + xlog_sync(log, iclog, ticket); spin_lock(&log->l_icloglock); } @@ -898,7 +900,7 @@ xlog_unmount_write( else ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || iclog->ic_state == XLOG_STATE_IOERROR); - error = xlog_state_release_iclog(log, iclog); + error = xlog_state_release_iclog(log, iclog, tic); xlog_wait_on_iclog(iclog); if (tic) { @@ -1930,7 +1932,8 @@ xlog_calc_iclog_size( STATIC void xlog_sync( struct xlog *log, - struct xlog_in_core *iclog) + struct xlog_in_core *iclog, + struct xlog_ticket *ticket) { unsigned int count; /* byte count of bwrite */ unsigned int roundoff; /* roundoff to BB or stripe */ @@ -1941,12 +1944,20 @@ xlog_sync( count = xlog_calc_iclog_size(log, iclog, &roundoff); - /* move grant heads by roundoff in sync */ - xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff); - xlog_grant_add_space(log, &log->l_write_head.grant, roundoff); + /* + * If we have a ticket, account for the roundoff via the ticket + * reservation to avoid touching the hot grant heads needlessly. + * Otherwise, we have to move grant heads directly. + */ + if (ticket) { + ticket->t_curr_res -= roundoff; + } else { + xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff); + xlog_grant_add_space(log, &log->l_write_head.grant, roundoff); + } /* put cycle number in every block */ - xlog_pack_data(log, iclog, roundoff); + xlog_pack_data(log, iclog, roundoff); /* real byte length */ size = iclog->ic_offset; @@ -2187,7 +2198,7 @@ xlog_write_get_more_iclog_space( xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || iclog->ic_state == XLOG_STATE_IOERROR); - error = xlog_state_release_iclog(log, iclog); + error = xlog_state_release_iclog(log, iclog, ticket); spin_unlock(&log->l_icloglock); if (error) return error; @@ -2470,7 +2481,7 @@ xlog_write( ASSERT(optype & XLOG_COMMIT_TRANS); *commit_iclog = iclog; } else { - error = xlog_state_release_iclog(log, iclog); + error = xlog_state_release_iclog(log, iclog, ticket); } spin_unlock(&log->l_icloglock); @@ -2929,7 +2940,7 @@ xlog_state_get_iclog_space( * reference to the iclog. */ if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) - error = xlog_state_release_iclog(log, iclog); + error = xlog_state_release_iclog(log, iclog, ticket); spin_unlock(&log->l_icloglock); if (error) return error; @@ -3157,7 +3168,7 @@ xfs_log_force( atomic_inc(&iclog->ic_refcnt); lsn = be64_to_cpu(iclog->ic_header.h_lsn); xlog_state_switch_iclogs(log, iclog, 0); - if (xlog_state_release_iclog(log, iclog)) + if (xlog_state_release_iclog(log, iclog, NULL)) goto out_error; if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) @@ -3250,7 +3261,7 @@ xlog_force_lsn( } atomic_inc(&iclog->ic_refcnt); xlog_state_switch_iclogs(log, iclog, 0); - if (xlog_state_release_iclog(log, iclog)) + if (xlog_state_release_iclog(log, iclog, NULL)) goto out_error; if (log_flushed) *log_flushed = 1; diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index d60c72ad391a..aef60f19ab05 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -804,6 +804,7 @@ xlog_cil_push_work( int cpu; struct xlog_cil_pcp *cilpcp; LIST_HEAD (log_items); + struct xlog_ticket *ticket; new_ctx = xlog_cil_ctx_alloc(); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -1037,12 +1038,10 @@ xlog_cil_push_work( if (error) goto out_abort_free_ticket; - xfs_log_ticket_ungrant(log, ctx->ticket); - spin_lock(&commit_iclog->ic_callback_lock); if (commit_iclog->ic_state == XLOG_STATE_IOERROR) { spin_unlock(&commit_iclog->ic_callback_lock); - goto out_abort; + goto out_abort_free_ticket; } ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE || commit_iclog->ic_state == XLOG_STATE_WANT_SYNC); @@ -1073,12 +1072,23 @@ xlog_cil_push_work( commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH; } + /* + * Pull the ticket off the ctx so we can ungrant it after releasing the + * commit_iclog. The ctx may be freed by the time we return from + * releasing the commit_iclog (i.e. checkpoint has been completed and + * callback run) so we can't reference the ctx after the call to + * xlog_state_release_iclog(). + */ + ticket = ctx->ticket; + /* release the hounds! */ spin_lock(&log->l_icloglock); if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE) xlog_state_switch_iclogs(log, commit_iclog, 0); - xlog_state_release_iclog(log, commit_iclog); + xlog_state_release_iclog(log, commit_iclog, ticket); spin_unlock(&log->l_icloglock); + + xfs_log_ticket_ungrant(log, ticket); return; out_skip: @@ -1089,7 +1099,6 @@ xlog_cil_push_work( out_abort_free_ticket: xfs_log_ticket_ungrant(log, ctx->ticket); -out_abort: ASSERT(XLOG_FORCED_SHUTDOWN(log)); xlog_cil_committed(ctx); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 6a4160200417..3d43d3940757 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -487,7 +487,8 @@ int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, int eventual_size); -int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog); +int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog, + struct xlog_ticket *ticket); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); From patchwork Fri Mar 5 05:11:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12117753 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28891C43333 for ; Fri, 5 Mar 2021 05:12:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 03DDC65015 for ; Fri, 5 Mar 2021 05:12:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229609AbhCEFMz (ORCPT ); Fri, 5 Mar 2021 00:12:55 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:52666 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229517AbhCEFMP (ORCPT ); Fri, 5 Mar 2021 00:12:15 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 7992F10432A for ; Fri, 5 Mar 2021 16:11:52 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lI2kh-00Fbpt-M8 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lI2kh-000lav-E3 for linux-xfs@vger.kernel.org; Fri, 05 Mar 2021 16:11:51 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 45/45] xfs: expanding delayed logging design with background material Date: Fri, 5 Mar 2021 16:11:43 +1100 Message-Id: <20210305051143.182133-46-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=dESyimp9J3IA:10 a=20KFwNOVAAAA:8 a=4d_rCvqidBtsbfniY4IA:9 a=dDgocgMmBh4IpvIW:21 a=_mjfLlQ2BSInTqtN:21 a=7sonQtKGC4y59yaF:21 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner I wrote up a description of how transactions, space reservations and relogging work together in response to a question for background material on the delayed logging design. Add this to the existing document for ease of future reference. Signed-off-by: Dave Chinner --- .../xfs-delayed-logging-design.rst | 362 ++++++++++++++++-- 1 file changed, 323 insertions(+), 39 deletions(-) diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst index 464405d2801e..e02235911ff3 100644 --- a/Documentation/filesystems/xfs-delayed-logging-design.rst +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst @@ -1,29 +1,315 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -XFS Delayed Logging Design -========================== - -Introduction to Re-logging in XFS -================================= - -XFS logging is a combination of logical and physical logging. Some objects, -such as inodes and dquots, are logged in logical format where the details -logged are made up of the changes to in-core structures rather than on-disk -structures. Other objects - typically buffers - have their physical changes -logged. The reason for these differences is to reduce the amount of log space -required for objects that are frequently logged. Some parts of inodes are more -frequently logged than others, and inodes are typically more frequently logged -than any other object (except maybe the superblock buffer) so keeping the -amount of metadata logged low is of prime importance. - -The reason that this is such a concern is that XFS allows multiple separate -modifications to a single object to be carried in the log at any given time. -This allows the log to avoid needing to flush each change to disk before -recording a new change to the object. XFS does this via a method called -"re-logging". Conceptually, this is quite simple - all it requires is that any -new change to the object is recorded with a *new copy* of all the existing -changes in the new transaction that is written to the log. +================== +XFS Logging Design +================== + +Preamble +======== + +This document describes the design and algorithms that the XFS journalling +subsystem is based on. While originally focussed on just the design of +the delayed logging extension introduced in 2010, it assumed the reader already +had a fair amount of in-depth knowledge about how XFS transactions are formed +and executed. It also largely omitted any details of how journal space +reservations are accounted for to ensure the operation of the logging subsystem +is guaranteed to be deadlock-free. + +Much of the original document is retained unmodified because it is still valid +and correct. It also allows the new background material to avoid long +descriptions for various algorithms because they are already well documented +in the original document (e.g. what "relogging" is and why it is needed). + +Hence we first start with an overview of transactions, followed by the way +transaction reservations are structured and accounted, +and then move into how we guarantee forwards progress for long running +transactions with finite initial reservations bounds. At this point we need +to explain how relogging works, and that is where the original document +started. + +Introduction +============ + +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata +are atomic and recoverable. For reasons of space and time efficiency, the +logging mechanisms are varied and complex, combining intents, logical and +physical logging mechanisms to provide the necessary recovery guarantees the +filesystem requires. + +Some objects, such as inodes and dquots, are logged in logical format where the +details logged are made up of the changes to in-core structures rather than +on-disk structures. Other objects - typically buffers - have their physical +changes logged. And long running atomic modifications have individual changes +chained together by intents, ensuring that journal recovery can restart and +finish an operation that was only partially done when the system stopped +functioning. + +The reason for these differences is to keep the amount of log space and CPU time +required to process objects being modified as small as possible and hence the +logging overhead as low as possible. Some items are very frequently modified, +and some parts of objects are more frequently modified than others, so so +keeping the overhead of metadata logging low is of prime importance. + +The method used to log an item or chain modifications together isn't +particularly important in the scope of this document. It suffices to know that +the method used for logging a particular object or chaining modifications +together are different and are dependent on the object and/or modification being +performed. The logging subsystem only cares that certain specific rules are +followed to guarantee forwards progress and prevent deadlocks. + + +Transactions in XFS +=================== + +XFS has two types of high level transactions, defined by the type of log space +reservation they take. These are known as "one shot" and "permanent" +transactions. Permanent transaction reservations can be used for one-shot +transactions, but one-shot reservations cannot be used for permanent +transactions. Reservations must be matched to the modification taking place. + +In the code, a one-shot transaction pattern looks somewhat like this:: + + tp = xfs_trans_alloc() + + + xfs_trans_commit(tp); + +As items are modified in the transaction, the dirty regions in those items are +Once the transaction is committed, all resources joined to it are released, +along with the remaining unused reservation space that was taken at the +transaction allocation time. + +In contrast, a permanent transaction is made up of multiple linked individual +transactions, and the pattern looks like this:: + + tp = xfs_trans_alloc() + xfs_ilock(ip, XFS_ILOCK_EXCL) + + loop { + xfs_trans_ijoin(tp, 0); + + xfs_trans_log_inode(tp, ip); + xfs_trans_roll(&tp); + } + + xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + +While this might look similar to a one-shot transaction, there is an important +difference: xfs_trans_roll() performs a specific operation that links two +transactions together:: + + ntp = xfs_trans_dup(tp); + xfs_trans_commit(tp); + xfs_log_reserve(ntp); + +This results in a series of "rolling transactions" where the inode is locked +across the entire chain of transactions. Hence while this series of rolling +transactions is running, nothing else can read from or write to the inode and +this provides a mechanism for complex changes to appear atomic from an external +observer's point of view. + +It is important to note that a series of rolling transactions in a permanent +transaction does not form an atomic change in the journal. While each +individual modification is atomic, the chain is *not atomic*. If we crash half +way through, then recovery will only replay up to the last transactional +modification the loop made that was committed to the journal. + + +Transactions are Asynchronous +============================= + +In XFS, all high level transactions are asynchronous by default. This means that +xfs_trans_commit() does not guarantee that the modification has been committed +to stable storage when it returns. Hence when a system crashes, not all the +completed transactions will be replayed during recovery. + +However, the logging subsystem does provide global ordering guarantees, such +that if a specific change is seen after recovery, all metadata modifications +that were committed prior to that change will also be seen. + +This affects long running permanent transactions in that it is not possible to +predict how much of a long running operation will actually be recovered because +there is no guarantee of how much of the operation reached stale storage. Hence +if a long running operation requires multiple transactions to fully complete, +the high level operation must use intents and defered operations to guarantee +recovery can complete the operation once the first modification reached the +journal. + +For single shot operations that need to reach stable storage immediately, or +ensuring that a long running permanent transaction is fully committed once it is +complete, we can explicitly tag a transaction as synchronous. This will trigger +a "log force" to flush the outstanding committed transactions to stable storage +in the journal and wait for that to complete. + +Synchronous transactions are rarely used, however, because they limit logging +throughput to the IO latency limitations of the underlying storage. Instead, we +tend to use log forces to ensure modifications are on stable storage only when +a user operation requires a synchronisation point to occur (e.g. fsync). + + +Transaction Reservations +======================== + +It has been mentioned a number of times now that the logging subsystem needs to +provide a forwards progress guarantee so that no modification ever stalls +because it can't be written to the journal due to a lack of space in the +journal. This is acheived by the transaction reservations that are made when +a transaction is first allocated. For permanent transactions, these reservations +are maintained as part of the transaction rolling mechanism. + +A transaction reservation provides a guarantee that there is physical log space +available to write the modification into the journal before we start making +modifications to objects and items. As such, the reservation needs to be large +enough to take into account the amount of metadata that the change might need to +log in the worst case. This means that if we are modifying a btree in the +transaction, we have to reserve enough space to record a full leaf-to-root split +of the btree. As such, the reservations are quite complex because we have to +take into account all the hidden changes that might occur. + +For example, a user data extent allocation involves allocating an extent from +free space, which modifies the free space trees. That's two btrees. Then +inserting the extent into the inode extent map requires modify another btree, +which might require mor allocation that modifies the free space btrees again. +Then we might have to update reverse mappings, which modifies yet another btree +which might require more space. And so on. Hence the amount of metadata that a +"simple" operation can modify can be quite large. + +This "worst case" calculation provides us with the static "unit reservation" +for the transaction that is calculated at mount time. We must guarantee that the +log has this much space available before the transaction is allowed to proceed +so that when we come to write the dirty metadata into the log we don't run out +of log space half way through the write. + +For one-shot transactions, a single unit space reservation is all that is +required for the transaction to proceed. For permanent transactions, however, we +also have a "log count" that affects the size of the reservation that is to be +made. + +While a permanent transaction can get by with a single unit of space +reservation, it is somewhat inefficient to do this as it requires the +transaction rolling mechanism to re-reserve space on every transaction roll. We +know from the implementation of the permanent transactions how many transaction +rolls are likely for the common modifications that need to be made. + +For example, and inode allocation is typically two transactions - one to +physically allocate a free inode chunk on disk, and another to allocate an inode +from an inode chunk that has free inodes in it. Hence for an inode allocation +transaction, we might set the reservation log count to a value of 2 to indicate +that the common/fast path transaction will commit two linked transactions in a +chain. Each time a permanent transaction rolls, it consumes an entire unit +reservation. + +Hence when the permanent transaction is first allocated, the log space +reservation is increases from a single unit reservation to multiple unit +reservations. That multiple is defined by the reservation log count, and this +means we can roll the transaction multiple times before we have to re-reserve +log space when we roll the transaction. This ensures that the common +modifications we make only need to reserve log space once. + +If the log count for a permanent transation reaches zero, then it needs to +re-reserve physical space in the log. This is somewhat complex, and requires +an understanding of how the log accounts for space that has been reserved. + + +Log Space Accounting +==================== + +The position in the log is typically referred to as a Log Sequence Number (LSN). +The log is circular, so the positions in the log are defined by the combination +of a cycle number - the number of times the log has been overwritten - and the +offset into the log. A LSN carries the cycle in the upper 32 bits and the +offset in the lower 32 bits. The offset is in units of "basic blocks" (512 +bytes). Hence we can do realtively simple LSN based math to keep track of +available space in the log. + +Log space accounting is done via a pair of constructs called "grant heads". The +position of the grant heads is an absolute value, so the amount of space +available in the log is defined by the distance between the position of the +grant head and the current log tail. That is, how much space can be +reserved/consumed before the grant heads would fully wrap the log and overtake +the tail position. + +The first grant head is the "reserve" head. This tracks the byte count of the +reservations currently held by active transactions. It is a purely in-memory +accounting of the space reservation and, as such, actually tracks byte offsets +into the log rather than basic blocks. Hence it technically isn't using LSNs to +represent the log position, but it is still treated like a split {cycle,offset} +tuple for the purposes of tracking reservation space. + +The reserve grant head is used to accurately account for exact transaction +reservations amounts and the exact byte count that modifications actually make +and need to write into the log. The reserve head is used to prevent new +transactions from taking new reservations when the head reaches the current +tail. It will block new reservations in a FIFO queue and as the log tail moves +forward it will wake them in order once sufficient space is available. This FIFO +mechanism ensures no transaction is starved of resources when log space +shortages occur. + +The other grant head is the "write" head. Unlike the reserve head, this grant +head contains an LSN and it tracks the physical space usage in the log. While +this might sound like it is accounting the same state as the reserve grant head +- and it mostly does track exactly the same location as the reserve grant head - +there are critical differences in behaviour between them that provides the +forwards progress guarantees that rolling permanent transactions require. + +These differences when a permanent transaction is rolled and the internal "log +count" reaches zero and the initial set of unit reservations have been +exhausted. At this point, we still require a log space reservation to continue +the next transaction in the sequeunce, but we have none remaining. We cannot +sleep during the transaction commit process waiting for new log space to become +available, as we may end up on the end of the FIFO queue and the items we have +locked while we sleep could end up pinning the tail of the log before there is +enough free space in the log to fulfil all of the pending reservations and +then wake up transaction commit in progress. + +To take a new reservation without sleeping requires us to be able to take a +reservation even if there is no reservation space currently available. That is, +we need to be able to *overcommit* the log reservation space. As has already +been detailed, we cannot overcommit physical log space. However, the reserve +grant head does not track physical space - it only accounts for the amount of +reservations we currently have outstanding. Hence if the reserve head passes +over the tail of the log all it means is that new reservations will be throttled +immediately and remain throttled until the log tail is moved forward far enough +to remove the overcommit and start taking new reservations. In other words, we +can overcommit the reserve head without violating the physical log head and tail +rules. + +As a result, permanent transactions only "regrant" reservation space during +xfs_trans_commit() calls, while the physical log space reservation - tracked by +the write head - is then reserved separately by a call to xfs_log_reserve() +after the commit completes. Once the commit completes, we can sleep waiting for +physical log space to be reserved from the write grant head, but only if one +critical rule has been observed:: + + Code using permanent reservations must always log the items they hold + locked across each transaction they roll in the chain. + +"Re-logging" the locked items on every transaction roll ensures that the items +the transaction chain is rolling are always relocated to the physical head of +the log and so do not pin the tail of the log. If a locked item pins the tail of +the log when we sleep on the write reservation, then we will deadlock the log as +we cannot take the locks needed to write back that item and move the tail of the +log forwards to free up write grant space. Re-logging the locked items avoids +this deadlock and guarantees that the log reservation we are making cannot +self-deadlock. + +If all rolling transactions obey this rule, then they can all make forwards +progress independently because nothing will block the progress of the log +tail moving forwards and hence ensuring that write grant space is always +(eventually) made available to permanent transactions no matter how many times +they roll. + + +Re-logging Explained +==================== + +XFS allows multiple separate modifications to a single object to be carried in +the log at any given time. This allows the log to avoid needing to flush each +change to disk before recording a new change to the object. XFS does this via a +method called "re-logging". Conceptually, this is quite simple - all it requires +is that any new change to the object is recorded with a *new copy* of all the +existing changes in the new transaction that is written to the log. That is, if we have a sequence of changes A through to F, and the object was written to disk after change D, we would see in the log the following series @@ -42,16 +328,13 @@ transaction:: In other words, each time an object is relogged, the new transaction contains the aggregation of all the previous changes currently held only in the log. -This relogging technique also allows objects to be moved forward in the log so -that an object being relogged does not prevent the tail of the log from ever -moving forward. This can be seen in the table above by the changing -(increasing) LSN of each subsequent transaction - the LSN is effectively a -direct encoding of the location in the log of the transaction. +This relogging technique allows objects to be moved forward in the log so that +an object being relogged does not prevent the tail of the log from ever moving +forward. This can be seen in the table above by the changing (increasing) LSN +of each subsequent transaction, and it's the technique that allows us to +implement long-running, multiple-commit permanent transactions. -This relogging is also used to implement long-running, multiple-commit -transactions. These transaction are known as rolling transactions, and require -a special log reservation known as a permanent transaction reservation. A -typical example of a rolling transaction is the removal of extents from an +A typical example of a rolling transaction is the removal of extents from an inode which can only be done at a rate of two extents per transaction because of reservation size limitations. Hence a rolling extent removal transaction keeps relogging the inode and btree buffers as they get modified in each @@ -67,12 +350,13 @@ the log over and over again. Worse is the fact that objects tend to get dirtier as they get relogged, so each subsequent transaction is writing more metadata into the log. -Another feature of the XFS transaction subsystem is that most transactions are -asynchronous. That is, they don't commit to disk until either a log buffer is -filled (a log buffer can hold multiple transactions) or a synchronous operation -forces the log buffers holding the transactions to disk. This means that XFS is -doing aggregation of transactions in memory - batching them, if you like - to -minimise the impact of the log IO on transaction throughput. +It should now also be obvious how relogging and asynchronous transactions go +hand in hand. That is, transactions don't get written to the physical journal +until either a log buffer is filled (a log buffer can hold multiple +transactions) or a synchronous operation forces the log buffers holding the +transactions to disk. This means that XFS is doing aggregation of transactions +in memory - batching them, if you like - to minimise the impact of the log IO on +transaction throughput. The limitation on asynchronous transaction throughput is the number and size of log buffers made available by the log manager. By default there are 8 log