From patchwork Tue Feb 23 03:34:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099845 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D3A2C433E9 for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BC59461481 for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231367AbhBWDfa (ORCPT ); Mon, 22 Feb 2021 22:35:30 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:47980 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231382AbhBWDf3 (ORCPT ); Mon, 22 Feb 2021 22:35:29 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 9736C827B4D for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001kV-Oy for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di09-HA for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 1/8] xfs: log stripe roundoff is a property of the log Date: Tue, 23 Feb 2021 14:34:35 +1100 Message-Id: <20210223033442.3267258-2-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=D3aekpjFca_zLI-q5woA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We don't need to look at the xfs_mount and superblock every time we need to do an iclog roundoff calculation. The property is fixed for the life of the log, so store the roundoff in the log at mount time and use that everywhere. On a debug build: $ size fs/xfs/xfs_log.o.* text data bss dec hex filename 27360 560 8 27928 6d18 fs/xfs/xfs_log.o.orig 27219 560 8 27787 6c8b fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Brian Foster --- fs/xfs/libxfs/xfs_log_format.h | 3 -- fs/xfs/xfs_log.c | 59 ++++++++++++++-------------------- fs/xfs/xfs_log_priv.h | 2 ++ 3 files changed, 27 insertions(+), 37 deletions(-) diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 8bd00da6d2a4..16587219549c 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t; #define XLOG_MIN_RECORD_BSHIFT 14 /* 16384 == 1 << 14 */ #define XLOG_BIG_RECORD_BSHIFT 15 /* 32k == 1 << 15 */ #define XLOG_MAX_RECORD_BSHIFT 18 /* 256k == 1 << 18 */ -#define XLOG_BTOLSUNIT(log, b) (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \ - (log)->l_mp->m_sb.sb_logsunit) -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit) #define XLOG_HEADER_SIZE 512 diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 06041834daa3..fa284f26d10e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1399,6 +1399,11 @@ xlog_alloc_log( xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0); log->l_curr_cycle = 1; /* 0 is bad since this is initial value */ + if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) + log->l_iclog_roundoff = mp->m_sb.sb_logsunit; + else + log->l_iclog_roundoff = BBSIZE; + xlog_grant_head_init(&log->l_reserve_head); xlog_grant_head_init(&log->l_write_head); @@ -1852,29 +1857,15 @@ xlog_calc_iclog_size( uint32_t *roundoff) { uint32_t count_init, count; - bool use_lsunit; - - use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) && - log->l_mp->m_sb.sb_logsunit > 1; /* Add for LR header */ count_init = log->l_iclog_hsize + iclog->ic_offset; + count = roundup(count_init, log->l_iclog_roundoff); - /* Round out the log write size */ - if (use_lsunit) { - /* we have a v2 stripe unit to use */ - count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init)); - } else { - count = BBTOB(BTOBB(count_init)); - } - - ASSERT(count >= count_init); *roundoff = count - count_init; - if (use_lsunit) - ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit); - else - ASSERT(*roundoff < BBTOB(1)); + ASSERT(count >= count_init); + ASSERT(*roundoff < log->l_iclog_roundoff); return count; } @@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs( log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize); /* Round up to next log-sunit */ - if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) && - log->l_mp->m_sb.sb_logsunit > 1) { - uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit); - log->l_curr_block = roundup(log->l_curr_block, sunit_bb); + if (log->l_iclog_roundoff > BBSIZE) { + log->l_curr_block = roundup(log->l_curr_block, + BTOBB(log->l_iclog_roundoff)); } if (log->l_curr_block >= log->l_logBBsize) { @@ -3404,12 +3394,11 @@ xfs_log_ticket_get( * Figure out the total log space unit (in bytes) that would be * required for a log ticket. */ -int -xfs_log_calc_unit_res( - struct xfs_mount *mp, +static int +xlog_calc_unit_res( + struct xlog *log, int unit_bytes) { - struct xlog *log = mp->m_log; int iclog_space; uint num_headers; @@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res( /* for commit-rec LR header - note: padding will subsume the ophdr */ unit_bytes += log->l_iclog_hsize; - /* for roundoff padding for transaction data and one for commit record */ - if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) { - /* log su roundoff */ - unit_bytes += 2 * mp->m_sb.sb_logsunit; - } else { - /* BB roundoff */ - unit_bytes += 2 * BBSIZE; - } + /* roundoff padding for transaction data and one for commit record */ + unit_bytes += 2 * log->l_iclog_roundoff; return unit_bytes; } +int +xfs_log_calc_unit_res( + struct xfs_mount *mp, + int unit_bytes) +{ + return xlog_calc_unit_res(mp->m_log, unit_bytes); +} + /* * Allocate and initialise a new log ticket. */ @@ -3513,7 +3504,7 @@ xlog_ticket_alloc( tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL); - unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes); + unit_res = xlog_calc_unit_res(log, unit_bytes); atomic_set(&tic->t_ref, 1); tic->t_task = current; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 1c6fdbf3d506..037950cf1061 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -436,6 +436,8 @@ struct xlog { #endif /* log recovery lsn tracking (for buffer submission */ xfs_lsn_t l_recovery_lsn; + + uint32_t l_iclog_roundoff;/* padding roundoff */ }; #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \ From patchwork Tue Feb 23 03:34:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099841 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90315C433E0 for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4762A64E25 for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230425AbhBWDf2 (ORCPT ); Mon, 22 Feb 2021 22:35:28 -0500 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:34025 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230498AbhBWDf2 (ORCPT ); Mon, 22 Feb 2021 22:35:28 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 917601ADBE2 for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001kW-Q1 for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0C-IS for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 2/8] xfs: separate CIL commit record IO Date: Tue, 23 Feb 2021 14:34:36 +1100 Message-Id: <20210223033442.3267258-3-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=-CzamR9Lu324sNjnbDUA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner To allow for iclog IO device cache flush behaviour to be optimised, we first need to separate out the commit record iclog IO from the rest of the checkpoint so we can wait for the checkpoint IO to complete before we issue the commit record. This separation is only necessary if the commit record is being written into a different iclog to the start of the checkpoint as the upcoming cache flushing changes requires completion ordering against the other iclogs submitted by the checkpoint. If the entire checkpoint and commit is in the one iclog, then they are both covered by the one set of cache flush primitives on the iclog and hence there is no need to separate them for ordering. Otherwise, we need to wait for all the previous iclogs to complete so they are ordered correctly and made stable by the REQ_PREFLUSH that the commit record iclog IO issues. This guarantees that if a reader sees the commit record in the journal, they will also see the entire checkpoint that commit record closes off. This also provides the guarantee that when the commit record IO completes, we can safely unpin all the log items in the checkpoint so they can be written back because the entire checkpoint is stable in the journal. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 55 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_log_cil.c | 7 ++++++ fs/xfs/xfs_log_priv.h | 2 ++ 3 files changed, 64 insertions(+) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index fa284f26d10e..ff26fb46d70f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -808,6 +808,61 @@ xlog_wait_on_iclog( return 0; } +/* + * Wait on any iclogs that are still flushing in the range of start_lsn to the + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise + * holds no log locks. + * + * We walk backwards through the iclogs to find the iclog with the highest lsn + * in the range that we need to wait for and then wait for it to complete. + * Completion ordering of iclog IOs ensures that all prior iclogs to the + * candidate iclog we need to sleep on have been complete by the time our + * candidate has completed it's IO. + * + * Therefore we only need to find the first iclog that isn't clean within the + * span of our flush range. If we come across a clean, newly activated iclog + * with a lsn of 0, it means IO has completed on this iclog and all previous + * iclogs will be have been completed prior to this one. Hence finding a newly + * activated iclog indicates that there are no iclogs in the range we need to + * wait on and we are done searching. + */ +int +xlog_wait_on_iclog_lsn( + struct xlog_in_core *iclog, + xfs_lsn_t start_lsn) +{ + struct xlog *log = iclog->ic_log; + struct xlog_in_core *prev; + int error = -EIO; + + spin_lock(&log->l_icloglock); + if (XLOG_FORCED_SHUTDOWN(log)) + goto out_unlock; + + error = 0; + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { + + /* Done if the lsn is before our start lsn */ + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), + start_lsn) < 0) + break; + + /* Don't need to wait on completed, clean iclogs */ + if (prev->ic_state == XLOG_STATE_DIRTY || + prev->ic_state == XLOG_STATE_ACTIVE) { + continue; + } + + /* wait for completion on this iclog */ + xlog_wait(&prev->ic_force_wait, &log->l_icloglock); + return 0; + } + +out_unlock: + spin_unlock(&log->l_icloglock); + return error; +} + /* * Write out an unmount record using the ticket provided. We have to account for * the data space used in the unmount ticket as this write is not done from a diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b0ef071b3cb5..c5cc1b7ad25e 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -870,6 +870,13 @@ xlog_cil_push_work( wake_up_all(&cil->xc_commit_wait); spin_unlock(&cil->xc_push_lock); + /* + * If the checkpoint spans multiple iclogs, wait for all previous + * iclogs to complete before we submit the commit_iclog. + */ + if (ctx->start_lsn != commit_lsn) + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); + /* release the hounds! */ xfs_log_release_iclog(commit_iclog); return; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 037950cf1061..a7ac85aaff4e 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -584,6 +584,8 @@ xlog_wait( remove_wait_queue(wq, &wait); } +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn); + /* * The LSN is valid so long as it is behind the current LSN. If it isn't, this * means that the next log record that includes this metadata could have a From patchwork Tue Feb 23 03:34:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099843 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B48E4C433DB for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6C77164E4B for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230498AbhBWDf3 (ORCPT ); Mon, 22 Feb 2021 22:35:29 -0500 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:37026 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231211AbhBWDf2 (ORCPT ); Mon, 22 Feb 2021 22:35:28 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 97FEFFA628C for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001kZ-Qc for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0F-JM for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Date: Tue, 23 Feb 2021 14:34:37 +1100 Message-Id: <20210223033442.3267258-4-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=3J8Cqu4lOtRIifzBSocA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Move it to xfs_bio_io.c as we are about to add new async cache flush functionality that uses bios directly, so all this stuff should be in the same place. Rename the function to xfs_flush_bdev() to match the xfs_rw_bdev() function that already exists in this file. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R --- fs/xfs/xfs_bio_io.c | 8 ++++++++ fs/xfs/xfs_buf.c | 2 +- fs/xfs/xfs_file.c | 6 +++--- fs/xfs/xfs_linux.h | 1 + fs/xfs/xfs_log.c | 2 +- fs/xfs/xfs_super.c | 7 ------- fs/xfs/xfs_super.h | 1 - 7 files changed, 14 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c index e2148f2d5d6b..5abf653a45d4 100644 --- a/fs/xfs/xfs_bio_io.c +++ b/fs/xfs/xfs_bio_io.c @@ -59,3 +59,11 @@ xfs_rw_bdev( invalidate_kernel_vmap_range(data, count); return error; } + +void +xfs_flush_bdev( + struct block_device *bdev) +{ + blkdev_issue_flush(bdev, GFP_NOFS); +} + diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f6e5235df7c9..b1d6c530c693 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1958,7 +1958,7 @@ xfs_free_buftarg( percpu_counter_destroy(&btp->bt_io_count); list_lru_destroy(&btp->bt_lru); - xfs_blkdev_issue_flush(btp); + xfs_flush_bdev(btp->bt_bdev); kmem_free(btp); } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 38528e59030e..dd33ef2d0e20 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -196,9 +196,9 @@ xfs_file_fsync( * inode size in case of an extending write. */ if (XFS_IS_REALTIME_INODE(ip)) - xfs_blkdev_issue_flush(mp->m_rtdev_targp); + xfs_flush_bdev(mp->m_rtdev_targp->bt_bdev); else if (mp->m_logdev_targp != mp->m_ddev_targp) - xfs_blkdev_issue_flush(mp->m_ddev_targp); + xfs_flush_bdev(mp->m_ddev_targp->bt_bdev); /* * Any inode that has dirty modifications in the log is pinned. The @@ -218,7 +218,7 @@ xfs_file_fsync( */ if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) && mp->m_logdev_targp == mp->m_ddev_targp) - xfs_blkdev_issue_flush(mp->m_ddev_targp); + xfs_flush_bdev(mp->m_ddev_targp->bt_bdev); return error; } diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index af6be9b9ccdf..e94a2aeefee8 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -196,6 +196,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y) int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count, char *data, unsigned int op); +void xfs_flush_bdev(struct block_device *bdev); #define ASSERT_ALWAYS(expr) \ (likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__)) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index ff26fb46d70f..493454c98c6f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2015,7 +2015,7 @@ xlog_sync( * layer state machine for preflushes. */ if (log->l_targ != log->l_mp->m_ddev_targp || split) { - xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp); + xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev); need_flush = false; } diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 21b1d034aca3..85dd9593b40b 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -339,13 +339,6 @@ xfs_blkdev_put( blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); } -void -xfs_blkdev_issue_flush( - xfs_buftarg_t *buftarg) -{ - blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS); -} - STATIC void xfs_close_devices( struct xfs_mount *mp) diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h index 1ca484b8357f..79cb2dece811 100644 --- a/fs/xfs/xfs_super.h +++ b/fs/xfs/xfs_super.h @@ -88,7 +88,6 @@ struct block_device; extern void xfs_quiesce_attr(struct xfs_mount *mp); extern void xfs_flush_inodes(struct xfs_mount *mp); -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *); extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *, xfs_agnumber_t agcount); From patchwork Tue Feb 23 03:34:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099839 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D165AC433E0 for ; Tue, 23 Feb 2021 03:35:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7BDE764E4B for ; Tue, 23 Feb 2021 03:35:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231445AbhBWDf1 (ORCPT ); Mon, 22 Feb 2021 22:35:27 -0500 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:34013 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230425AbhBWDf1 (ORCPT ); Mon, 22 Feb 2021 22:35:27 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 680B71ADBDD for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001kf-RQ for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0I-K8 for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 4/8] xfs: async blkdev cache flush Date: Tue, 23 Feb 2021 14:34:38 +1100 Message-Id: <20210223033442.3267258-5-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=6w1O1Ts8PCk-CgjR3CgA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The new checkpoint caceh flush mechanism requires us to issue an unconditional cache flush before we start a new checkpoint. We don't want to block for this if we can help it, and we have a fair chunk of CPU work to do between starting the checkpoint and issuing the first journal IO. Hence it makes sense to amortise the latency cost of the cache flush by issuing it asynchronously and then waiting for it only when we need to issue the first IO in the transaction. TO do this, we need async cache flush primitives to submit the cache flush bio and to wait on it. THe block layer has no such primitives for filesystems, so roll our own for the moment. Signed-off-by: Dave Chinner Reviewed-by: Chaitanya Kulkarni Reviewed-by: Chandan Babu R --- fs/xfs/xfs_bio_io.c | 30 ++++++++++++++++++++++++++++++ fs/xfs/xfs_linux.h | 1 + 2 files changed, 31 insertions(+) diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c index 5abf653a45d4..d55420bc72b5 100644 --- a/fs/xfs/xfs_bio_io.c +++ b/fs/xfs/xfs_bio_io.c @@ -67,3 +67,33 @@ xfs_flush_bdev( blkdev_issue_flush(bdev, GFP_NOFS); } +void +xfs_flush_bdev_async_endio( + struct bio *bio) +{ + if (bio->bi_private) + complete(bio->bi_private); + bio_put(bio); +} + +/* + * Submit a request for an async cache flush to run. If the caller needs to wait + * for the flush completion at a later point in time, they must supply a + * valid completion. This will be signalled when the flush completes. + * The caller never sees the bio that is issued here. + */ +void +xfs_flush_bdev_async( + struct block_device *bdev, + struct completion *done) +{ + struct bio *bio; + + bio = bio_alloc(GFP_NOFS, 0); + bio_set_dev(bio, bdev); + bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; + bio->bi_private = done; + bio->bi_end_io = xfs_flush_bdev_async_endio; + + submit_bio(bio); +} diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index e94a2aeefee8..293ff2355e80 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -197,6 +197,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y) int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count, char *data, unsigned int op); void xfs_flush_bdev(struct block_device *bdev); +void xfs_flush_bdev_async(struct block_device *bdev, struct completion *done); #define ASSERT_ALWAYS(expr) \ (likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__)) From patchwork Tue Feb 23 03:34:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099847 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7083C433E6 for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 94BA364E4D for ; Tue, 23 Feb 2021 03:35:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231211AbhBWDfa (ORCPT ); Mon, 22 Feb 2021 22:35:30 -0500 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:55331 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231367AbhBWDf3 (ORCPT ); Mon, 22 Feb 2021 22:35:29 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 90CCEFA9D5A for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001kj-SP for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0L-Kv for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Date: Tue, 23 Feb 2021 14:34:39 +1100 Message-Id: <20210223033442.3267258-6-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=gwj23wwd_E3P4KkB8XMA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. These rules apply to the atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). The ordering guarantees of #1 are currently provided by REQ_PREFLUSH being added to every iclog IO. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. However, for long running CIL checkpoints that might do a thousand journal IOs, we don't need every single one of these iclog IOs to issue a cache flush - the cache flush done before the first iclog is submitted is sufficient to cover the entire range in the log that the checkpoint will overwrite because the CIL space reservation guarantees the tail of the log (completed metadata) is already beyond the range of the checkpoint write. Hence we only need a full cache flush between closing off the CIL checkpoint context (i.e. when the push switches it out) and issuing the first journal IO. Rather than plumbing this through to the journal IO, we can start this cache flush the moment the CIL context is owned exclusively by the push worker. The cache flush can be in progress while we process the CIL ready for writing, hence reducing the latency of the initial iclog write. This is especially true for large checkpoints, where we might have to process hundreds of thousands of log vectors before we issue the first iclog write. In these cases, it is likely the cache flush has already been completed by the time we have built the CIL log vector chain. Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index c5cc1b7ad25e..8bcacd463f06 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -656,6 +656,7 @@ xlog_cil_push_work( struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; + DECLARE_COMPLETION_ONSTACK(bdev_flush); new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -719,10 +720,24 @@ xlog_cil_push_work( spin_unlock(&cil->xc_push_lock); /* - * pull all the log vectors off the items in the CIL, and - * remove the items from the CIL. We don't need the CIL lock - * here because it's only needed on the transaction commit - * side which is currently locked out by the flush lock. + * The CIL is stable at this point - nothing new will be added to it + * because we hold the flush lock exclusively. Hence we can now issue + * a cache flush to ensure all the completed metadata in the journal we + * are about to overwrite is on stable storage. + * + * This avoids the need to have the iclogs issue REQ_PREFLUSH based + * cache flushes to provide this ordering guarantee, and hence for CIL + * checkpoints that require hundreds or thousands of log writes no + * longer need to issue device cache flushes to provide metadata + * writeback ordering. + */ + xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush); + + /* + * Pull all the log vectors off the items in the CIL, and remove the + * items from the CIL. We don't need the CIL lock here because it's only + * needed on the transaction commit side which is currently locked out + * by the flush lock. */ lv = NULL; num_iovecs = 0; @@ -806,6 +821,12 @@ xlog_cil_push_work( lvhdr.lv_iovecp = &lhdr; lvhdr.lv_next = ctx->lv_chain; + /* + * Before we format and submit the first iclog, we have to ensure that + * the metadata writeback ordering cache flush is complete. + */ + wait_for_completion(&bdev_flush); + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); if (error) goto out_abort_free_ticket; From patchwork Tue Feb 23 03:34:40 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099849 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE, TVD_PH_BODY_META_ALL,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61E86C4332B for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1C0AF64E4D for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231387AbhBWDfb (ORCPT ); Mon, 22 Feb 2021 22:35:31 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:35864 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230486AbhBWDf3 (ORCPT ); Mon, 22 Feb 2021 22:35:29 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 75CF01041060 for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001km-TS for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0O-Ls for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Date: Tue, 23 Feb 2021 14:34:40 +1100 Message-Id: <20210223033442.3267258-7-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=U2sKdf1BhhSb6LhUA4oA:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The CIL push is the only call to xlog_write that sets this variable to true. The other callers don't need a start rec, and they tell xlog_write what to do by passing the type of ophdr they need written in the flags field. The need_start_rec parameter essentially tells xlog_write to to write an extra ophdr with a XLOG_START_TRANS type, so get rid of the variable to do this and pass XLOG_START_TRANS as the flag value into xlog_write() from the CIL push. $ size fs/xfs/xfs_log.o* text data bss dec hex filename 27595 560 8 28163 6e03 fs/xfs/xfs_log.o.orig 27454 560 8 28022 6d76 fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 44 +++++++++++++++++++++---------------------- fs/xfs/xfs_log_cil.c | 3 ++- fs/xfs/xfs_log_priv.h | 3 +-- 3 files changed, 25 insertions(+), 25 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 493454c98c6f..6c3fb6dcb505 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -871,9 +871,7 @@ xlog_wait_on_iclog_lsn( static int xlog_write_unmount_record( struct xlog *log, - struct xlog_ticket *ticket, - xfs_lsn_t *lsn, - uint flags) + struct xlog_ticket *ticket) { struct xfs_unmount_log_format ulf = { .magic = XLOG_UNMOUNT_TYPE, @@ -890,7 +888,7 @@ xlog_write_unmount_record( /* account for space used by record data */ ticket->t_curr_res -= sizeof(ulf); - return xlog_write(log, &vec, ticket, lsn, NULL, flags, false); + return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); } /* @@ -904,15 +902,13 @@ xlog_unmount_write( struct xfs_mount *mp = log->l_mp; struct xlog_in_core *iclog; struct xlog_ticket *tic = NULL; - xfs_lsn_t lsn; - uint flags = XLOG_UNMOUNT_TRANS; int error; error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0); if (error) goto out_err; - error = xlog_write_unmount_record(log, tic, &lsn, flags); + error = xlog_write_unmount_record(log, tic); /* * At this point, we're umounting anyway, so there's no point in * transitioning log state to IOERROR. Just continue... @@ -1604,8 +1600,7 @@ xlog_commit_record( if (XLOG_FORCED_SHUTDOWN(log)) return -EIO; - error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS, - false); + error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -2202,13 +2197,16 @@ static int xlog_write_calc_vec_length( struct xlog_ticket *ticket, struct xfs_log_vec *log_vector, - bool need_start_rec) + uint optype) { struct xfs_log_vec *lv; - int headers = need_start_rec ? 1 : 0; + int headers = 0; int len = 0; int i; + if (optype & XLOG_START_TRANS) + headers++; + for (lv = log_vector; lv; lv = lv->lv_next) { /* we don't write ordered log vectors */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) @@ -2428,8 +2426,7 @@ xlog_write( struct xlog_ticket *ticket, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, - uint flags, - bool need_start_rec) + uint optype) { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv = log_vector; @@ -2457,8 +2454,9 @@ xlog_write( xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); } - len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec); - *start_lsn = 0; + len = xlog_write_calc_vec_length(ticket, log_vector, optype); + if (start_lsn) + *start_lsn = 0; while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { void *ptr; int log_offset; @@ -2472,7 +2470,7 @@ xlog_write( ptr = iclog->ic_datap + log_offset; /* start_lsn is the first lsn written to. That's all we need. */ - if (!*start_lsn) + if (start_lsn && !*start_lsn) *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); /* @@ -2485,6 +2483,7 @@ xlog_write( int copy_len; int copy_off; bool ordered = false; + bool wrote_start_rec = false; /* ordered log vectors have no regions to write */ if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) { @@ -2502,13 +2501,15 @@ xlog_write( * write a start record. Only do this for the first * iclog we write to. */ - if (need_start_rec) { + if (optype & XLOG_START_TRANS) { xlog_write_start_rec(ptr, ticket); xlog_write_adv_cnt(&ptr, &len, &log_offset, sizeof(struct xlog_op_header)); + optype &= ~XLOG_START_TRANS; + wrote_start_rec = true; } - ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags); + ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype); if (!ophdr) return -EIO; @@ -2539,14 +2540,13 @@ xlog_write( } copy_len += sizeof(struct xlog_op_header); record_cnt++; - if (need_start_rec) { + if (wrote_start_rec) { copy_len += sizeof(struct xlog_op_header); record_cnt++; - need_start_rec = false; } data_cnt += contwr ? copy_len : 0; - error = xlog_write_copy_finish(log, iclog, flags, + error = xlog_write_copy_finish(log, iclog, optype, &record_cnt, &data_cnt, &partial_copy, &partial_copy_len, @@ -2590,7 +2590,7 @@ xlog_write( spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); if (commit_iclog) { - ASSERT(flags & XLOG_COMMIT_TRANS); + ASSERT(optype & XLOG_COMMIT_TRANS); *commit_iclog = iclog; } else { error = xlog_state_release_iclog(log, iclog); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 8bcacd463f06..4093d2d0db7c 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -827,7 +827,8 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush); - error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, + XLOG_START_TRANS); if (error) goto out_abort_free_ticket; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index a7ac85aaff4e..10a41b1dd895 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -480,8 +480,7 @@ void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_ticket *tic, xfs_lsn_t *start_lsn, - struct xlog_in_core **commit_iclog, uint flags, - bool need_start_rec); + struct xlog_in_core **commit_iclog, uint optype); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); From patchwork Tue Feb 23 03:34:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099851 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94026C4332D for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 42D2D64E25 for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230486AbhBWDfc (ORCPT ); Mon, 22 Feb 2021 22:35:32 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:47984 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231439AbhBWDfb (ORCPT ); Mon, 22 Feb 2021 22:35:31 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id A66AE827CB9 for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001ko-UO for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0R-Mn for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 7/8] xfs: journal IO cache flush reductions Date: Tue, 23 Feb 2021 14:34:41 +1100 Message-Id: <20210223033442.3267258-8-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=eJfxgxciAAAA:8 a=20KFwNOVAAAA:8 a=opL9Lr4cFnCCcJbhqBEA:9 a=xM9caqqi1sUkTy8OJ5Uh:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Steve Lord Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. The ordering guarantees of #1 are provided by REQ_PREFLUSH. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. The ordering guarantees of #2 are provided by the REQ_FUA, which ensures the journal writes do not complete until they are on stable storage. Hence by the time the last journal IO in a checkpoint completes, we know that the entire checkpoint is on stable storage and we can unpin the dirty metadata and allow it to be written back. This is the mechanism by which ordering was first implemented in XFS way back in 2002 by this commit: commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 Author: Steve Lord Date: Fri May 24 14:30:21 2002 +0000 Add support for drive write cache flushing - should the kernel have the infrastructure A lot has changed since then, most notably we now use delayed logging to checkpoint the filesystem to the journal rather than write each individual transaction to the journal. Cache flushes on journal IO are necessary when individual transactions are wholly contained within a single iclog. However, CIL checkpoints are single transactions that typically span hundreds to thousands of individual journal writes, and so the requirements for device cache flushing have changed. That is, the ordering rules I state above apply to ordering of atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). Hence we only need a REQ_PREFLUSH on the journal IO that starts a new journal transaction to provide #1, and it is not on any other journal IO done within the context of that journal transaction. The CIL checkpoint already issues a cache flush before it starts writing to the log, so we no longer need the iclog IO to issue a REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed to xlog_write(), we no longer need to mark the first iclog in the log write with REQ_PREFLUSH for this case. Given the new ordering semantics of commit records for the CIL, we need iclogs containing commit to issue a REQ_PREFLUSH. We also require unmount records to do this. Hence for both XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark the first iclog being written with REQ_PREFLUSH. For both commit records and unmount records, we also want them immediately on stable storage, so we want to also mark the iclogs that contain these records to be marked REQ_FUA. That means if a record is split across multiple iclogs, they are all marked REQ_FUA and not just the last one so that when the transaction is completed all the parts of the record are on stable storage. As an optimisation, when the commit record lands in the same iclog as the journal transaction starts, we don't need to wait for anything and can simply use REQ_FUA to provide guarantee #2. This means that for fsync() heavy workloads, the cache flush behaviour is completely unchanged and there is no degradation in performance as a result of optimise the multi-IO transaction case. The most notable sign that there is less IO latency on my test machine (nvme SSDs) is that the "noiclogs" rate has dropped substantially. This metric indicates that the CIL push is blocking in xlog_get_iclog_space() waiting for iclog IO completion to occur. With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space() is blocking waiting for log IO. With the changes in this patch, this drops to 1 noiclog event for every 100 iclog writes. Hence it is clear that log IO is completing much faster than it was previously, but it is also clear that for large iclog sizes, this isn't the performance limiting factor on this hardware. With smaller iclogs (32kB), however, there is a sustantial difference. With the cache flush modifications, the journal is now running at over 4000 write IOPS, and the journal throughput is largely identical to the 256kB iclogs and the noiclog event rate stays low at about 1:50 iclog writes. The existing code tops out at about 2500 IOPS as the number of cache flushes dominate performance and latency. The noiclog event rate is about 1:4, and the performance variance is quite large as the journal throughput can fall to less than half the peak sustained rate when the cache flush rate prevents metadata writeback from keeping up and the log runs out of space and throttles reservations. As a result: logbsize fsmark create rate rm -rf before 32kb 152851+/-5.3e+04 5m28s patched 32kb 221533+/-1.1e+04 5m24s before 256kb 220239+/-6.2e+03 4m58s patched 256kb 228286+/-9.2e+03 5m06s The rm -rf times are included because I ran them, but the differences are largely noise. This workload is largely metadata read IO latency bound and the changes to the journal cache flushing doesn't really make any noticable difference to behaviour apart from a reduction in noiclog events from background CIL pushing. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log.c | 33 +++++++++++++++++++++++---------- fs/xfs/xfs_log_cil.c | 7 ++++++- fs/xfs/xfs_log_priv.h | 4 ++++ 3 files changed, 33 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 6c3fb6dcb505..08d68a6161ae 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1806,8 +1806,7 @@ xlog_write_iclog( struct xlog *log, struct xlog_in_core *iclog, uint64_t bno, - unsigned int count, - bool need_flush) + unsigned int count) { ASSERT(bno < log->l_logBBsize); @@ -1845,10 +1844,12 @@ xlog_write_iclog( * writeback throttle from throttling log writes behind background * metadata writeback and causing priority inversions. */ - iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | - REQ_IDLE | REQ_FUA; - if (need_flush) + iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE; + if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) iclog->ic_bio.bi_opf |= REQ_PREFLUSH; + if (iclog->ic_flags & XLOG_ICL_NEED_FUA) + iclog->ic_bio.bi_opf |= REQ_FUA; + iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) { xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); @@ -1951,7 +1952,7 @@ xlog_sync( unsigned int roundoff; /* roundoff to BB or stripe */ uint64_t bno; unsigned int size; - bool need_flush = true, split = false; + bool split = false; ASSERT(atomic_read(&iclog->ic_refcnt) == 0); @@ -2009,13 +2010,14 @@ xlog_sync( * synchronously here; for an internal log we can simply use the block * layer state machine for preflushes. */ - if (log->l_targ != log->l_mp->m_ddev_targp || split) { + if (log->l_targ != log->l_mp->m_ddev_targp || + (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) { xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev); - need_flush = false; + iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH; } xlog_verify_iclog(log, iclog, count); - xlog_write_iclog(log, iclog, bno, count, need_flush); + xlog_write_iclog(log, iclog, bno, count); } /* @@ -2469,10 +2471,21 @@ xlog_write( ASSERT(log_offset <= iclog->ic_size - 1); ptr = iclog->ic_datap + log_offset; - /* start_lsn is the first lsn written to. That's all we need. */ + /* Start_lsn is the first lsn written to. */ if (start_lsn && !*start_lsn) *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); + /* + * iclogs containing commit records or unmount records need + * to issue ordering cache flushes and commit immediately + * to stable storage to guarantee journal vs metadata ordering + * is correctly maintained in the storage media. + */ + if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) { + iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | + XLOG_ICL_NEED_FUA); + } + /* * This loop writes out as many regions as can fit in the amount * of space which was allocated by xlog_state_get_iclog_space(). diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 4093d2d0db7c..370da7c2bfc8 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -894,10 +894,15 @@ xlog_cil_push_work( /* * If the checkpoint spans multiple iclogs, wait for all previous - * iclogs to complete before we submit the commit_iclog. + * iclogs to complete before we submit the commit_iclog. If it is in the + * same iclog as the start of the checkpoint, then we can skip the iclog + * cache flush because there are no other iclogs we need to order + * against. */ if (ctx->start_lsn != commit_lsn) xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); + else + commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH; /* release the hounds! */ xfs_log_release_iclog(commit_iclog); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 10a41b1dd895..24acdc54e44e 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -133,6 +133,9 @@ enum xlog_iclog_state { #define XLOG_COVER_OPS 5 +#define XLOG_ICL_NEED_FLUSH (1 << 0) /* iclog needs REQ_PREFLUSH */ +#define XLOG_ICL_NEED_FUA (1 << 0) /* iclog needs REQ_FUA */ + /* Ticket reservation region accounting */ #define XLOG_TIC_LEN_MAX 15 @@ -201,6 +204,7 @@ typedef struct xlog_in_core { u32 ic_size; u32 ic_offset; enum xlog_iclog_state ic_state; + unsigned int ic_flags; char *ic_datap; /* pointer to iclog data */ /* Callback structures need their own cacheline */ From patchwork Tue Feb 23 03:34:42 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 12099853 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 979D1C43331 for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 69E4364E4B for ; Tue, 23 Feb 2021 03:35:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231439AbhBWDfc (ORCPT ); Mon, 22 Feb 2021 22:35:32 -0500 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:35307 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231343AbhBWDfb (ORCPT ); Mon, 22 Feb 2021 22:35:31 -0500 Received: from dread.disaster.area (pa49-179-130-210.pa.nsw.optusnet.com.au [49.179.130.210]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 964E2106DFA for ; Tue, 23 Feb 2021 14:34:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1lEOTF-0001ks-Vw for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.94) (envelope-from ) id 1lEOTF-00Di0U-Nu for linux-xfs@vger.kernel.org; Tue, 23 Feb 2021 14:34:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Date: Tue, 23 Feb 2021 14:34:42 +1100 Message-Id: <20210223033442.3267258-9-david@fromorbit.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210223033442.3267258-1-david@fromorbit.com> References: <20210223033442.3267258-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=JD06eNgDs9tuHP7JIKoLzw==:117 a=JD06eNgDs9tuHP7JIKoLzw==:17 a=qa6Q16uM49sA:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=jMqSS0pQBfBzRtNEyY0A:9 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner A hang with tasks stuck on the CIL hard throttle was reported and largely diagnosed by Donald Buczek, who discovered that it was a result of the CIL context space usage decrementing in committed transactions once the hard throttle limit had been hit and processes were already blocked. This resulted in the CIL push not waking up those waiters because the CIL context was no longer over the hard throttle limit. The surprising aspect of this was the CIL space usage going backwards regularly enough to trigger this situation. Assumptions had been made in design that the relogging process would only increase the size of the objects in the CIL, and so that space would only increase. This change and commit message fixes the issue and documents the result of an audit of the triggers that can cause the CIL space to go backwards, how large the backwards steps tend to be, the frequency in which they occur, and what the impact on the CIL accounting code is. Even though the CIL ctx->space_used can go backwards, it will only do so if the log item is already logged to the CIL and contains a space reservation for it's entire logged state. This is tracked by the shadow buffer state on the log item. If the item is not previously logged in the CIL it has no shadow buffer nor log vector, and hence the entire size of the logged item copied to the log vector is accounted to the CIL space usage. i.e. it will always go up in this case. If the item has a log vector (i.e. already in the CIL) and the size decreases, then the existing log vector will be overwritten and the space usage will go down. This is the only condition where the space usage reduces, and it can only occur when an item is already tracked in the CIL. Hence we are safe from CIL space usage underruns as a result of log items decreasing in size when they are relogged. Typically this reduction in CIL usage occurs from metadta blocks being free, such as when a btree block merge occurs or a directory enter/xattr entry is removed and the da-tree is reduced in size. This generally results in a reduction in size of around a single block in the CIL, but also tends to increase the number of log vectors because the parent and sibling nodes in the tree needs to be updated when a btree block is removed. If a multi-level merge occurs, then we see reduction in size of 2+ blocks, but again the log vector count goes up. The other vector is inode fork size changes, which only log the current size of the fork and ignore the previously logged size when the fork is relogged. Hence if we are removing items from the inode fork (dir/xattr removal in shortform, extent record removal in extent form, etc) the relogged size of the inode for can decrease. No other log items can decrease in size either because they are a fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging an intent actually creates a new intent log item and doesn't relog the old item at all.) Hence the only two vectors for CIL context size reduction are relogging inode forks and marking buffers active in the CIL as stale. Long story short: the majority of the code does the right thing and handles the reduction in log item size correctly, and only the CIL hard throttle implementation is problematic and needs fixing. This patch makes that fix, as well as adds comments in the log item code that result in items shrinking in size when they are relogged as a clear reminder that this can and does happen frequently. The throttle fix is based upon the change Donald proposed, though it goes further to ensure that once the throttle is activated, it captures all tasks until the CIL push issues a wakeup, regardless of whether the CIL space used has gone back under the throttle threshold. This ensures that we prevent tasks reducing the CIL slightly under the throttle threshold and then making more changes that push it well over the throttle limit. This is acheived by checking if the throttle wait queue is already active as a condition of throttling. Hence once we start throttling, we continue to apply the throttle until the CIL context push wakes everything on the wait queue. We can use waitqueue_active() for the waitqueue manipulations and checks as they are all done under the ctx->xc_push_lock. Hence the waitqueue has external serialisation and we can safely peek inside the wait queue without holding the internal waitqueue locks. Many thanks to Donald for his diagnostic and analysis work to isolate the cause of this hang. Reported-by: Donald Buczek Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Chandan Babu R Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 37 ++++++++++++++++++------------------- fs/xfs/xfs_inode_item.c | 14 ++++++++++++++ fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++----- 3 files changed, 49 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index dc0be2a639cc..17960b1ce5ef 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -56,14 +56,12 @@ xfs_buf_log_format_size( } /* - * This returns the number of log iovecs needed to log the - * given buf log item. + * Return the number of log iovecs and space needed to log the given buf log + * item segment. * - * It calculates this as 1 iovec for the buf log format structure - * and 1 for each stretch of non-contiguous chunks to be logged. - * Contiguous chunks are logged in a single iovec. - * - * If the XFS_BLI_STALE flag has been set, then log nothing. + * It calculates this as 1 iovec for the buf log format structure and 1 for each + * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged + * in a single iovec. */ STATIC void xfs_buf_item_size_segment( @@ -119,11 +117,8 @@ xfs_buf_item_size_segment( } /* - * This returns the number of log iovecs needed to log the given buf log item. - * - * It calculates this as 1 iovec for the buf log format structure and 1 for each - * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged - * in a single iovec. + * Return the number of log iovecs and space needed to log the given buf log + * item. * * Discontiguous buffers need a format structure per region that is being * logged. This makes the changes in the buffer appear to log recovery as though @@ -133,7 +128,11 @@ xfs_buf_item_size_segment( * what ends up on disk. * * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log - * format structures. + * format structures. If the item has previously been logged and has dirty + * regions, we do not relog them in stale buffers. This has the effect of + * reducing the size of the relogged item by the amount of dirty data tracked + * by the log item. This can result in the committing transaction reducing the + * amount of space being consumed by the CIL. */ STATIC void xfs_buf_item_size( @@ -147,9 +146,9 @@ xfs_buf_item_size( ASSERT(atomic_read(&bip->bli_refcount) > 0); if (bip->bli_flags & XFS_BLI_STALE) { /* - * The buffer is stale, so all we need to log - * is the buf log format structure with the - * cancel flag in it. + * The buffer is stale, so all we need to log is the buf log + * format structure with the cancel flag in it as we are never + * going to replay the changes tracked in the log item. */ trace_xfs_buf_item_size_stale(bip); ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL); @@ -164,9 +163,9 @@ xfs_buf_item_size( if (bip->bli_flags & XFS_BLI_ORDERED) { /* - * The buffer has been logged just to order it. - * It is not being included in the transaction - * commit, so no vectors are used at all. + * The buffer has been logged just to order it. It is not being + * included in the transaction commit, so no vectors are used at + * all. */ trace_xfs_buf_item_size_ordered(bip); *nvecs = XFS_LOG_VEC_ORDERED; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 17e20a6d8b4e..6ff91e5bf3cd 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) return container_of(lip, struct xfs_inode_log_item, ili_item); } +/* + * The logged size of an inode fork is always the current size of the inode + * fork. This means that when an inode fork is relogged, the size of the logged + * region is determined by the current state, not the combination of the + * previously logged state + the current state. This is different relogging + * behaviour to most other log items which will retain the size of the + * previously logged changes when smaller regions are relogged. + * + * Hence operations that remove data from the inode fork (e.g. shortform + * dir/attr remove, extent form extent removal, etc), the size of the relogged + * inode gets -smaller- rather than stays the same size as the previously logged + * size and this can result in the committing transaction reducing the amount of + * space being consumed by the CIL. + */ STATIC void xfs_inode_item_data_fork_size( struct xfs_inode_log_item *iip, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 370da7c2bfc8..0a00c3c9610c 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -669,9 +669,14 @@ xlog_cil_push_work( ASSERT(push_seq <= ctx->sequence); /* - * Wake up any background push waiters now this context is being pushed. + * As we are about to switch to a new, empty CIL context, we no longer + * need to throttle tasks on CIL space overruns. Wake any waiters that + * the hard push throttle may have caught so they can start committing + * to the new context. The ctx->xc_push_lock provides the serialisation + * necessary for safely using the lockless waitqueue_active() check in + * this context. */ - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) + if (waitqueue_active(&cil->xc_push_wait)) wake_up_all(&cil->xc_push_wait); /* @@ -941,7 +946,7 @@ xlog_cil_push_background( ASSERT(!list_empty(&cil->xc_cil)); /* - * don't do a background push if we haven't used up all the + * Don't do a background push if we haven't used up all the * space available yet. */ if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { @@ -965,9 +970,16 @@ xlog_cil_push_background( /* * If we are well over the space limit, throttle the work that is being - * done until the push work on this context has begun. + * done until the push work on this context has begun. Enforce the hard + * throttle on all transaction commits once it has been activated, even + * if the committing transactions have resulted in the space usage + * dipping back down under the hard limit. + * + * The ctx->xc_push_lock provides the serialisation necessary for safely + * using the lockless waitqueue_active() check in this context. */ - if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) || + waitqueue_active(&cil->xc_push_wait)) { trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); ASSERT(cil->xc_ctx->space_used < log->l_logsize); xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);