[06/45] xfs: CIL checkpoint flushes caches unconditionally

Message ID	20210305051143.182133-7-david@fromorbit.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Date: Fri, 5 Mar 2021 16:11:04 +1100 Message-Id: <20210305051143.182133-7-david@fromorbit.com> In-Reply-To: <20210305051143.182133-1-david@fromorbit.com> References: <20210305051143.182133-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: consolidated log and optimisation changes \| expand [00/45,v3] xfs: consolidated log and optimisation changes [01/45] xfs: initialise attr fork on inode create [02/45] xfs: log stripe roundoff is a property of the log [03/45] xfs: separate CIL commit record IO [04/45] xfs: remove xfs_blkdev_issue_flush [05/45] xfs: async blkdev cache flush [06/45] xfs: CIL checkpoint flushes caches unconditionally [07/45] xfs: remove need_start_rec parameter from xlog_write() [08/45] xfs: journal IO cache flush reductions [09/45] xfs: Fix CIL throttle hang when CIL space used going backwards [10/45] xfs: reduce buffer log item shadow allocations [11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset [12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions [13/45] xfs: xfs_log_force_lsn isn't passed a LSN [14/45] xfs: AIL needs asynchronous CIL forcing [15/45] xfs: CIL work is serialised, not pipelined [16/45] xfs: type verification is expensive [17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check [18/45] xfs: reduce debug overhead of dir leaf/node checks [19/45] xfs: factor out the CIL transaction header building [20/45] xfs: only CIL pushes require a start record [21/45] xfs: embed the xlog_op_header in the unmount record [22/45] xfs: embed the xlog_op_header in the commit record [23/45] xfs: log tickets don't need log client id [24/45] xfs: move log iovec alignment to preparation function [25/45] xfs: reserve space and initialise xlog_op_header in item formatting [26/45] xfs: log ticket region debug is largely useless [27/45] xfs: pass lv chain length into xlog_write() [28/45] xfs: introduce xlog_write_single() [29/45] xfs:_introduce xlog_write_partial() [30/45] xfs: xlog_write() no longer needs contwr state [31/45] xfs: CIL context doesn't need to count iovecs [32/45] xfs: use the CIL space used counter for emptiness checks [33/45] xfs: lift init CIL reservation out of xc_cil_lock [34/45] xfs: rework per-iclog header CIL reservation [35/45] xfs: introduce per-cpu CIL tracking sructure [36/45] xfs: implement percpu cil space used calculation [37/45] xfs: track CIL ticket reservation in percpu structure [38/45] xfs: convert CIL busy extents to per-cpu [39/45] xfs: Add order IDs to log items in CIL [40/45] xfs: convert CIL to unordered per cpu lists [41/45] xfs: move CIL ordering to the logvec chain [42/45] xfs: __percpu_counter_compare() inode count debug too expensive [43/45] xfs: avoid cil push lock if possible [44/45] xfs: xlog_sync() manually adjusts grant head space [45/45] xfs: expanding delayed logging design with background material

Message ID

20210305051143.182133-7-david@fromorbit.com (mailing list archive)

State

Superseded

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally
Date: Fri,  5 Mar 2021 16:11:04 +1100
Message-Id: <20210305051143.182133-7-david@fromorbit.com>
In-Reply-To: <20210305051143.182133-1-david@fromorbit.com>
References: <20210305051143.182133-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

xfs: consolidated log and optimisation changes | expand

Commit Message

Dave Chinner March 5, 2021, 5:11 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

These rules apply to the atomic transactions recorded in the
journal, not to the journal IO itself. Hence we need to ensure
metadata is stable before we start writing a new transaction to the
journal (guarantee #1), and we need to ensure the entire transaction
is stable in the journal before we start metadata writeback
(guarantee #2).

The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
being added to every iclog IO. This causes the journal IO to issue a
cache flush and wait for it to complete before issuing the write IO
to the journal. Hence all completed metadata IO is guaranteed to be
stable before the journal overwrites the old metadata.

However, for long running CIL checkpoints that might do a thousand
journal IOs, we don't need every single one of these iclog IOs to
issue a cache flush - the cache flush done before the first iclog is
submitted is sufficient to cover the entire range in the log that
the checkpoint will overwrite because the CIL space reservation
guarantees the tail of the log (completed metadata) is already
beyond the range of the checkpoint write.

Hence we only need a full cache flush between closing off the CIL
checkpoint context (i.e. when the push switches it out) and issuing
the first journal IO. Rather than plumbing this through to the
journal IO, we can start this cache flush the moment the CIL context
is owned exclusively by the push worker. The cache flush can be in
progress while we process the CIL ready for writing, hence
reducing the latency of the initial iclog write. This is especially
true for large checkpoints, where we might have to process hundreds
of thousands of log vectors before we issue the first iclog write.
In these cases, it is likely the cache flush has already been
completed by the time we have built the CIL log vector chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_cil.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

Comments

Brian Foster March 15, 2021, 2:43 p.m. UTC | #1

On Fri, Mar 05, 2021 at 04:11:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log_cil.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 1e5fd6f268c2..b4cdb8b6c4c3 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -719,10 +721,25 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 *
> +	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
> +	 * cache flushes to provide this ordering guarantee, and hence for CIL
> +	 * checkpoints that require hundreds or thousands of log writes no
> +	 * longer need to issue device cache flushes to provide metadata
> +	 * writeback ordering.
> +	 */

I don't think we need to have code comments to explain why some other
code doesn't do something or doesn't exist. This seems like something
that should stick to the commit log description (between this patch and
the future patch that removes the historical behavior). IOW, I'd just
drop that second paragraph.

Otherwise (and modulo my previous thoughts on the bio) LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> +				&bdev_flush);
> +
> +	/*
> +	 * Pull all the log vectors off the items in the CIL, and remove the
> +	 * items from the CIL. We don't need the CIL lock here because it's only
> +	 * needed on the transaction commit side which is currently locked out
> +	 * by the flush lock.
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> @@ -806,6 +823,12 @@ xlog_cil_push_work(
>  	lvhdr.lv_iovecp = &lhdr;
>  	lvhdr.lv_next = ctx->lv_chain;
>  
> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);
> +
>  	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
>  	if (error)
>  		goto out_abort_free_ticket;
> -- 
> 2.28.0
>

Christoph Hellwig March 16, 2021, 8:47 a.m. UTC | #2

> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);

.. and this would be where we'd check bio.bi_status for an error ..

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1e5fd6f268c2..b4cdb8b6c4c3 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -656,6 +656,8 @@  xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	struct bio		bio;
+	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -719,10 +721,25 @@  xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * The CIL is stable at this point - nothing new will be added to it
+	 * because we hold the flush lock exclusively. Hence we can now issue
+	 * a cache flush to ensure all the completed metadata in the journal we
+	 * are about to overwrite is on stable storage.
+	 *
+	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
+	 * cache flushes to provide this ordering guarantee, and hence for CIL
+	 * checkpoints that require hundreds or thousands of log writes no
+	 * longer need to issue device cache flushes to provide metadata
+	 * writeback ordering.
+	 */
+	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
+				&bdev_flush);
+
+	/*
+	 * Pull all the log vectors off the items in the CIL, and remove the
+	 * items from the CIL. We don't need the CIL lock here because it's only
+	 * needed on the transaction commit side which is currently locked out
+	 * by the flush lock.
 	 */
 	lv = NULL;
 	num_iovecs = 0;
@@ -806,6 +823,12 @@  xlog_cil_push_work(
 	lvhdr.lv_iovecp = &lhdr;
 	lvhdr.lv_next = ctx->lv_chain;
 
+	/*
+	 * Before we format and submit the first iclog, we have to ensure that
+	 * the metadata writeback ordering cache flush is complete.
+	 */
+	wait_for_completion(&bdev_flush);
+
 	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
 	if (error)
 		goto out_abort_free_ticket;

[06/45] xfs: CIL checkpoint flushes caches unconditionally

Commit Message

Comments

Patch