Message ID | 20210519121317.585244-6-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: CIL and log optimisations | expand |
On 5/19/21 5:12 AM, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to > guarantee the ordering requirements the journal has w.r.t. metadata > writeback. THe two ordering constraints are: > > 1. we cannot overwrite metadata in the journal until we guarantee > that the dirty metadata has been written back in place and is > stable. > > 2. we cannot write back dirty metadata until it has been written to > the journal and guaranteed to be stable (and hence recoverable) in > the journal. > > These rules apply to the atomic transactions recorded in the > journal, not to the journal IO itself. Hence we need to ensure > metadata is stable before we start writing a new transaction to the > journal (guarantee #1), and we need to ensure the entire transaction > is stable in the journal before we start metadata writeback > (guarantee #2). > > The ordering guarantees of #1 are currently provided by REQ_PREFLUSH > being added to every iclog IO. This causes the journal IO to issue a > cache flush and wait for it to complete before issuing the write IO > to the journal. Hence all completed metadata IO is guaranteed to be > stable before the journal overwrites the old metadata. > > However, for long running CIL checkpoints that might do a thousand > journal IOs, we don't need every single one of these iclog IOs to > issue a cache flush - the cache flush done before the first iclog is > submitted is sufficient to cover the entire range in the log that > the checkpoint will overwrite because the CIL space reservation > guarantees the tail of the log (completed metadata) is already > beyond the range of the checkpoint write. > > Hence we only need a full cache flush between closing off the CIL > checkpoint context (i.e. when the push switches it out) and issuing > the first journal IO. Rather than plumbing this through to the > journal IO, we can start this cache flush the moment the CIL context > is owned exclusively by the push worker. The cache flush can be in > progress while we process the CIL ready for writing, hence > reducing the latency of the initial iclog write. This is especially > true for large checkpoints, where we might have to process hundreds > of thousands of log vectors before we issue the first iclog write. > In these cases, it is likely the cache flush has already been > completed by the time we have built the CIL log vector chain. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > Reviewed-by: Brian Foster <bfoster@redhat.com> Ok, makes sense Reviewed-by: Allison Henderson <allison.henderson@oracle.com> > --- > fs/xfs/xfs_log_cil.c | 25 +++++++++++++++++++++---- > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 1e5fd6f268c2..7b8b7ac85ea9 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -656,6 +656,8 @@ xlog_cil_push_work( > struct xfs_log_vec lvhdr = { NULL }; > xfs_lsn_t commit_lsn; > xfs_lsn_t push_seq; > + struct bio bio; > + DECLARE_COMPLETION_ONSTACK(bdev_flush); > > new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); > new_ctx->ticket = xlog_cil_ticket_alloc(log); > @@ -719,10 +721,19 @@ xlog_cil_push_work( > spin_unlock(&cil->xc_push_lock); > > /* > - * pull all the log vectors off the items in the CIL, and > - * remove the items from the CIL. We don't need the CIL lock > - * here because it's only needed on the transaction commit > - * side which is currently locked out by the flush lock. > + * The CIL is stable at this point - nothing new will be added to it > + * because we hold the flush lock exclusively. Hence we can now issue > + * a cache flush to ensure all the completed metadata in the journal we > + * are about to overwrite is on stable storage. > + */ > + xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, > + &bdev_flush); > + > + /* > + * Pull all the log vectors off the items in the CIL, and remove the > + * items from the CIL. We don't need the CIL lock here because it's only > + * needed on the transaction commit side which is currently locked out > + * by the flush lock. > */ > lv = NULL; > num_iovecs = 0; > @@ -806,6 +817,12 @@ xlog_cil_push_work( > lvhdr.lv_iovecp = &lhdr; > lvhdr.lv_next = ctx->lv_chain; > > + /* > + * Before we format and submit the first iclog, we have to ensure that > + * the metadata writeback ordering cache flush is complete. > + */ > + wait_for_completion(&bdev_flush); > + > error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); > if (error) > goto out_abort_free_ticket; >
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 1e5fd6f268c2..7b8b7ac85ea9 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -656,6 +656,8 @@ xlog_cil_push_work( struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; + struct bio bio; + DECLARE_COMPLETION_ONSTACK(bdev_flush); new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -719,10 +721,19 @@ xlog_cil_push_work( spin_unlock(&cil->xc_push_lock); /* - * pull all the log vectors off the items in the CIL, and - * remove the items from the CIL. We don't need the CIL lock - * here because it's only needed on the transaction commit - * side which is currently locked out by the flush lock. + * The CIL is stable at this point - nothing new will be added to it + * because we hold the flush lock exclusively. Hence we can now issue + * a cache flush to ensure all the completed metadata in the journal we + * are about to overwrite is on stable storage. + */ + xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, + &bdev_flush); + + /* + * Pull all the log vectors off the items in the CIL, and remove the + * items from the CIL. We don't need the CIL lock here because it's only + * needed on the transaction commit side which is currently locked out + * by the flush lock. */ lv = NULL; num_iovecs = 0; @@ -806,6 +817,12 @@ xlog_cil_push_work( lvhdr.lv_iovecp = &lhdr; lvhdr.lv_next = ctx->lv_chain; + /* + * Before we format and submit the first iclog, we have to ensure that + * the metadata writeback ordering cache flush is complete. + */ + wait_for_completion(&bdev_flush); + error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); if (error) goto out_abort_free_ticket;