Message ID | 20210128044154.806715-3-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: various log stuff... | expand |
FYI, I finally got to reading your prior log cache flushing thread yesterday afternoon. I was planning to revisit that and probably reply this morning after having some time to digest, but saw this and so will reply here.. On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > To allow for iclog IO device cache flush behaviour to be optimised, > we first need to separate out the commit record iclog IO from the > rest of the checkpoint so we can wait for the checkpoint IO to > complete before we issue the commit record. > > This separate is only necessary if the commit record is being > written into a different iclog to the start of the checkpoint. If > the entire checkpoint and commit is in the one iclog, then they are > both covered by the one set of cache flush primitives on the iclog > and hence there is no need to separate them. > I find the description here a bit vague.. we have to separate out the commit record I/O, but only if it's already separate..? Glancing at the code, I don't see any new "separation" happening, so it's not clear to me what that really refers to. This looks more like we're introducing serialization to provide checkpoint -> commit record ordering (when the commit record happens to be in a separate iclog). > Otherwise, we need to wait for all the previous iclogs to complete > so they are ordered correctly and made stable by the REQ_PREFLUSH > that the commit record iclog IO issues. This guarantees that if a > reader sees the commit record in the journal, they will also see the > entire checkpoint that commit record closes off. > > This also provides the guarantee that when the commit record IO > completes, we can safely unpin all the log items in the checkpoint > so they can be written back because the entire checkpoint is stable > in the journal. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 47 +++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_log_cil.c | 7 +++++++ > fs/xfs/xfs_log_priv.h | 2 ++ > 3 files changed, 56 insertions(+) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index c5f507c24577..c5e3da23961c 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -808,6 +808,53 @@ xlog_wait_on_iclog( > return 0; > } > > +/* > + * Wait on any iclogs that are still flushing in the range of start_lsn to > + * the current iclog's lsn. The caller holds a reference to the iclog, but > + * otherwise holds no log locks. > + * > + * We walk backwards through the iclogs to find the iclog with the highest lsn > + * in the range that we need to wait for and then wait for it to complete. > + * Completion ordering of iclog IOs ensures that all prior iclogs to this IO are > + * complete by the time our candidate has completed. > + */ > +int > +xlog_wait_on_iclog_lsn( > + struct xlog_in_core *iclog, > + xfs_lsn_t start_lsn) > +{ > + struct xlog *log = iclog->ic_log; > + struct xlog_in_core *prev; > + int error = -EIO; > + > + spin_lock(&log->l_icloglock); > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto out_unlock; > + > + error = 0; > + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { > + > + /* Done if the lsn is before our start lsn */ > + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), > + start_lsn) < 0) > + break; Hmm.. that logic looks a bit dodgy when you consider that the iclog header lsn is reset to zero on activation. I think it actually works as intended because iclog completion reactivates iclogs in LSN order and this loop walks in reverse order, but that is a very subtle connection that might be useful to document. > + > + /* Don't need to wait on completed, clean iclogs */ > + if (prev->ic_state == XLOG_STATE_DIRTY || > + prev->ic_state == XLOG_STATE_ACTIVE) { > + continue; > + } > + > + /* wait for completion on this iclog */ > + xlog_wait(&prev->ic_force_wait, &log->l_icloglock); > + return 0; > + } > + > +out_unlock: > + spin_unlock(&log->l_icloglock); > + return error; > +} > + > /* > * Write out an unmount record using the ticket provided. We have to account for > * the data space used in the unmount ticket as this write is not done from a > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b0ef071b3cb5..c5cc1b7ad25e 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -870,6 +870,13 @@ xlog_cil_push_work( > wake_up_all(&cil->xc_commit_wait); > spin_unlock(&cil->xc_push_lock); > > + /* > + * If the checkpoint spans multiple iclogs, wait for all previous > + * iclogs to complete before we submit the commit_iclog. > + */ > + if (ctx->start_lsn != commit_lsn) > + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); > + This is an interesting serialization point because we don't necessarily submit the iclog that holds the commit record. I actually think in most cases the separate commit record won't land on disk until the next CIL push causes a sync of the current head iclog (assuming it fills it up). Granted, this is the last point we have context of the commit record being written, so it makes sense from an opportunistic standpoint to serialize here (just as we already block to ensure commit record ordering across checkpoints). That said, with the aggressive batching ability of the CIL I think blocking on prior pushes is potentially less heavy handed than unconditionally blocking on all prior iclog I/O. I wonder if this might be something to address if we're going to compound this path with more serialization. From a performance perspective, it seems like this makes the CIL push _nearly_ synchronous by default. I.e., if we have a several hundred MB CIL context, we're going to wait for all but the final iclog to complete before we return. Of course we'll be waiting for much of that anyways to reuse limited iclog space, but I have to think more about what that means in general (maybe not much) and get through the subsequent patches. In the meantime, have you put thought into any potential negative performance impact from this that might or might not be offset by subsequent flush optimizations? Brian > /* release the hounds! */ > xfs_log_release_iclog(commit_iclog); > return; > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 037950cf1061..a7ac85aaff4e 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -584,6 +584,8 @@ xlog_wait( > remove_wait_queue(wq, &wait); > } > > +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn); > + > /* > * The LSN is valid so long as it is behind the current LSN. If it isn't, this > * means that the next log record that includes this metadata could have a > -- > 2.28.0 >
On Thu, Jan 28, 2021 at 10:07:41AM -0500, Brian Foster wrote: > FYI, I finally got to reading your prior log cache flushing thread > yesterday afternoon. I was planning to revisit that and probably reply > this morning after having some time to digest, but saw this and so will > reply here.. > > On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > To allow for iclog IO device cache flush behaviour to be optimised, > > we first need to separate out the commit record iclog IO from the > > rest of the checkpoint so we can wait for the checkpoint IO to > > complete before we issue the commit record. > > > > This separate is only necessary if the commit record is being > > written into a different iclog to the start of the checkpoint. If > > the entire checkpoint and commit is in the one iclog, then they are > > both covered by the one set of cache flush primitives on the iclog > > and hence there is no need to separate them. > > > > I find the description here a bit vague.. we have to separate out the > commit record I/O, but only if it's already separate..? Yes, because if the commit record is in a different iclog to the start of the checkpoint, we have to ensure that the start of the checkpoint has been written and completed before we issue a PREFLUSH, otherwise the cache flush is not guaranteed to capture that IO. i.e. the block layer cache flushing mechanisms do not order against IOs in flight, only completed IOs. Hence if the commit record is in a different iclog, we have to wait for the other iclogs to complete before flushign the cache. If the checkpoint is wholly contained within a single iclog, then we don't have to wait for anything, the single journal IO has all the cache flushes we need. > Glancing at the > code, I don't see any new "separation" happening, so it's not clear to > me what that really refers to. This looks more like we're introducing > serialization to provide checkpoint -> commit record ordering (when the > commit record happens to be in a separate iclog). Yes, that's exactly the separation that is being done by the call to xlog_wait_on_iclog_lsn() from the CIL push code once we hold a reference to the commit iclog... > > > Otherwise, we need to wait for all the previous iclogs to complete > > so they are ordered correctly and made stable by the REQ_PREFLUSH > > that the commit record iclog IO issues. This guarantees that if a > > reader sees the commit record in the journal, they will also see the > > entire checkpoint that commit record closes off. > > > > This also provides the guarantee that when the commit record IO > > completes, we can safely unpin all the log items in the checkpoint > > so they can be written back because the entire checkpoint is stable > > in the journal. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log.c | 47 +++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_log_cil.c | 7 +++++++ > > fs/xfs/xfs_log_priv.h | 2 ++ > > 3 files changed, 56 insertions(+) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index c5f507c24577..c5e3da23961c 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > > @@ -808,6 +808,53 @@ xlog_wait_on_iclog( > > return 0; > > } > > > > +/* > > + * Wait on any iclogs that are still flushing in the range of start_lsn to > > + * the current iclog's lsn. The caller holds a reference to the iclog, but > > + * otherwise holds no log locks. > > + * > > + * We walk backwards through the iclogs to find the iclog with the highest lsn > > + * in the range that we need to wait for and then wait for it to complete. > > + * Completion ordering of iclog IOs ensures that all prior iclogs to this IO are > > + * complete by the time our candidate has completed. > > + */ > > +int > > +xlog_wait_on_iclog_lsn( > > + struct xlog_in_core *iclog, > > + xfs_lsn_t start_lsn) > > +{ > > + struct xlog *log = iclog->ic_log; > > + struct xlog_in_core *prev; > > + int error = -EIO; > > + > > + spin_lock(&log->l_icloglock); > > + if (XLOG_FORCED_SHUTDOWN(log)) > > + goto out_unlock; > > + > > + error = 0; > > + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { > > + > > + /* Done if the lsn is before our start lsn */ > > + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), > > + start_lsn) < 0) > > + break; > > Hmm.. that logic looks a bit dodgy when you consider that the iclog > header lsn is reset to zero on activation. I think it actually works as > intended because iclog completion reactivates iclogs in LSN order and > this loop walks in reverse order, but that is a very subtle connection > that might be useful to document. It is documented in the comment above the function "We walk backwards through the iclogs to find the iclog....". A newly activated iclog will have an LSN of zero, and that means there are no prior iclogs in the list in the range we need to flush because of the completion ordering guarantees we have for iclog IO (i.e. they are always completed in ring order, not IO completion order). > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index b0ef071b3cb5..c5cc1b7ad25e 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -870,6 +870,13 @@ xlog_cil_push_work( > > wake_up_all(&cil->xc_commit_wait); > > spin_unlock(&cil->xc_push_lock); > > > > + /* > > + * If the checkpoint spans multiple iclogs, wait for all previous > > + * iclogs to complete before we submit the commit_iclog. > > + */ > > + if (ctx->start_lsn != commit_lsn) > > + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); > > + > > This is an interesting serialization point because we don't necessarily > submit the iclog that holds the commit record. I actually think in most > cases the separate commit record won't land on disk until the next CIL > push causes a sync of the current head iclog (assuming it fills it up). Which is exactly what happens now. It's not a big deal, because if we are running a sync transaction, the log force will push it to disk and wait for it. CIL pushes are not synchronous and never have been - the log force is what creates the appearance of synchronous behaviour to the caller... > Granted, this is the last point we have context of the commit record > being written, so it makes sense from an opportunistic standpoint to > serialize here (just as we already block to ensure commit record > ordering across checkpoints). That said, with the aggressive batching > ability of the CIL I think blocking on prior pushes is potentially less > heavy handed than unconditionally blocking on all prior iclog I/O. I > wonder if this might be something to address if we're going to compound > this path with more serialization. This adds ordering to a CIL push, not serialisation across pushes. The CIL pipeline at this point is running fully asynchronously to user operations and other pushes. The only serialisation point with user operations is a log force as we've already ordered the commit record in the iclogs against other CIL flushes that haven't completed. i.e. there's nothing here to serialise against other CIL operations in progress - pushes or commits - it's just internal push IO ordering that we are waiting on here. > From a performance perspective, it seems like this makes the CIL push > _nearly_ synchronous by default. Nope, not at all. There is nothing waiting on the CIL push to complete, except maybe a log force waiting for the iclogs to be synced to disk and completed. > I.e., if we have a several hundred MB > CIL context, Not possible. A 2GB log has a hard throttle limit now of 64MB. > we're going to wait for all but the final iclog to complete > before we return. Of course we'll be waiting for much of that anyways to > reuse limited iclog space, but I have to think more about what that > means in general (maybe not much) and get through the subsequent > patches. In the meantime, have you put thought into any potential > negative performance impact from this that might or might not be offset > by subsequent flush optimizations? This isn't a stand-alone change - it's only separated out from the cache flush changes to make them easier to see. I do not expect there to be any performance degradation due to this change because of the fact that the CIL will only block things waiting on a synchronous log force at this point. In the case where there are lots of sync log forces, checkpoints will be small and single iclogs, so no change of behaviour. For checkpoints that are large, the reduction in IO latency and the ability for the block layer to now merge adjacent journal IO because they don't require FUA or cache flushes more than makes up for any additional latency the ordering might introduce. Cheers, Dave.
On Fri, Jan 29, 2021 at 09:58:51AM -0500, Brian Foster wrote: > On Fri, Jan 29, 2021 at 08:22:21AM +1100, Dave Chinner wrote: > > On Thu, Jan 28, 2021 at 10:07:41AM -0500, Brian Foster wrote: > > > FYI, I finally got to reading your prior log cache flushing thread > > > yesterday afternoon. I was planning to revisit that and probably reply > > > this morning after having some time to digest, but saw this and so will > > > reply here.. > > > > > > On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > To allow for iclog IO device cache flush behaviour to be optimised, > > > > we first need to separate out the commit record iclog IO from the > > > > rest of the checkpoint so we can wait for the checkpoint IO to > > > > complete before we issue the commit record. > > > > > > > > This separate is only necessary if the commit record is being > > > > written into a different iclog to the start of the checkpoint. If > > > > the entire checkpoint and commit is in the one iclog, then they are > > > > both covered by the one set of cache flush primitives on the iclog > > > > and hence there is no need to separate them. > > > > > > > > > > I find the description here a bit vague.. we have to separate out the > > > commit record I/O, but only if it's already separate..? > > > > Yes, because if the commit record is in a different iclog to the > > start of the checkpoint, we have to ensure that the start of the > > checkpoint has been written and completed before we issue a > > PREFLUSH, otherwise the cache flush is not guaranteed to capture > > that IO. i.e. the block layer cache flushing mechanisms do not order > > against IOs in flight, only completed IOs. Hence if the commit > > record is in a different iclog, we have to wait for the other iclogs > > to complete before flushign the cache. If the checkpoint is wholly > > contained within a single iclog, then we don't have to wait for > > anything, the single journal IO has all the cache flushes we need. > > > > Yeah, I get the functional behavior. I think the commit log would do > better to refer to serializing or ordering the commit record (when split > or separate) as opposed to separating it. Except we already have "ordering the commit record" functionality implemented in the CIL push code, where it explicitly orders the commit record for the push against the commit records of other pushes in progress. SO I don't think naming it "ordering the commit record" improves the situation because it introduces ambiguity into what that means. It's also not serialisation... > > > > +int > > > > +xlog_wait_on_iclog_lsn( > > > > + struct xlog_in_core *iclog, > > > > + xfs_lsn_t start_lsn) > > > > +{ > > > > + struct xlog *log = iclog->ic_log; > > > > + struct xlog_in_core *prev; > > > > + int error = -EIO; > > > > + > > > > + spin_lock(&log->l_icloglock); > > > > + if (XLOG_FORCED_SHUTDOWN(log)) > > > > + goto out_unlock; > > > > + > > > > + error = 0; > > > > + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { > > > > + > > > > + /* Done if the lsn is before our start lsn */ > > > > + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), > > > > + start_lsn) < 0) > > > > + break; > > > > > > Hmm.. that logic looks a bit dodgy when you consider that the iclog > > > header lsn is reset to zero on activation. I think it actually works as > > > intended because iclog completion reactivates iclogs in LSN order and > > > this loop walks in reverse order, but that is a very subtle connection > > > that might be useful to document. > > > > It is documented in the comment above the function "We walk > > backwards through the iclogs to find the iclog....". A newly > > activated iclog will have an LSN of zero, and that means there are > > no prior iclogs in the list in the range we need to flush because of > > the completion ordering guarantees we have for iclog IO (i.e. they > > are always completed in ring order, not IO completion order). > > > > Can you update the comment with that sentence to be more explicit about > the zero LSN case? > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > index b0ef071b3cb5..c5cc1b7ad25e 100644 > > > > --- a/fs/xfs/xfs_log_cil.c > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > @@ -870,6 +870,13 @@ xlog_cil_push_work( > > > > wake_up_all(&cil->xc_commit_wait); > > > > spin_unlock(&cil->xc_push_lock); > > > > > > > > + /* > > > > + * If the checkpoint spans multiple iclogs, wait for all previous > > > > + * iclogs to complete before we submit the commit_iclog. > > > > + */ > > > > + if (ctx->start_lsn != commit_lsn) > > > > + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); > > > > + > > > > > > This is an interesting serialization point because we don't necessarily > > > submit the iclog that holds the commit record. I actually think in most > > > cases the separate commit record won't land on disk until the next CIL > > > push causes a sync of the current head iclog (assuming it fills it up). > > > > Which is exactly what happens now. It's not a big deal, because if > > we are running a sync transaction, the log force will push it to > > disk and wait for it. CIL pushes are not synchronous and never have > > been - the log force is what creates the appearance of synchronous > > behaviour to the caller... > > > > Right, but otherwise the CIL push blocks on previous checkpoints (and > now the majority of the current checkpoint) only to write the commit > record into an iclog and probably not submit it until the next CIL > checkpoint. That's exactly the same behaviour we have now. Nothing guarantees that the iclog containing a commit record is written immediately. So this simply isn't a concern because we are not changing behaviour. > I agree it's probably not a big deal. The same behavior > likely repeats for subsequent CIL pushes, so it may not even be > noticeable. This is just an observation that we're compounding more > serialization This does not add serialisation! Serialisation means you do something that will stall *other tasks* because they have to wait for the current task to finish before they can proceed. Adding an _ordering delay_ into an unserialised task does not create new serialisation points, it just lengthens the runtime of that task. > on another bit of a historical impedance mismatch between > delayed logging and the underlying layer. We've potentially just wrote > out a large context, yet none of the attached items can be processed > until another checkpoint occurs because the callbacks that do AIL > insertion and whatnot don't run until the iclog with the commit record > hits the disk. Yup, classic two phase commit algorithm behaviour right there. We already have this occurring and it most definitely is not a problem that we need to solve. Indeed, this behaviour is the prime reason for log covering to exist... > This isn't necessarily a commentary on this patch, but it would be nice > if we could start thinking about handling that situation better; > particularly if we've just blocked waiting for the majority of that > checkpoint I/O to complete. The thing is, we are already blocking waiting for iclog IO to complete. We're doing it at a rate of at lease 1 in every 4 iclogs we need to fill in long running checkpoints. So the CIL push that needs to write a thousand iclogs before the commit record is written has already blocked on iclog IO completion at least 250 times. In comaprison, with the cache flush changes and this commit record separation, I'm seeing the iclog block rate drop to 1 in 100 iclogs written. The fact is that cache flushing is -far more expensive- and results is far more blocking and IO completion waiting in the CIL push than this code that orders the commit record by waiting for iclog completion to occur. CIL checkpoint submission is far more IO efficient and *much lower latency* with this single software level ordering delay than using hardware cache flushes to guarantee ordering. To focus on "this new ordering delay might add latency" completely misses the bigger picture that we just replaced 2-3 orders of magnitude of IO blocking delays with a single ordering delay. That's what's important here - it's not that we've added a single ordering delay into the checkpoint, but that we've removed *hundreds* of IO blocking delays in the checkpoint writing process. > That means there's an increased chance that > the next iclog in the ring may be active. Perhaps we could introduce > some logic to switch out the commit record iclog before the CIL push > returns in those particular cases. For example, switch out if the > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the > next iclog in the ring is clean..? We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, but then we can't aggregate more changes into it and fill it up. If someone starts waiting on that iclog (i.e. a log force), then it immediately gets marked WANT_SYNC and submitted to disk when it is released. But if there is no-one waiting on it, then we largely don't care if an asynchronous checkpoint is committed immediately, at the start of the next checkpoint, or at worst, within 30s when the log worker next kicks a log force.... It's just not necessary to ensure the commit record hits the disk with minimal latency... > > > From a performance perspective, it seems like this makes the CIL push > > > _nearly_ synchronous by default. > > > > Nope, not at all. There is nothing waiting on the CIL push to > > complete, except maybe a log force waiting for the iclogs to be > > synced to disk and completed. > > > > Right, that's sort of the point... we have various places that issue > async log forces. These async log forces push the CIL, which > historically only ever blocks on log I/O by virtue of waiting on iclog > space. IOW, there is a bit of runway between an async CIL push blocking > on I/O or not provided by the iclog mechanism. > > With this patch, a CIL push (and thus an async log force) now does not > return until all but the commit record iclog have been submitted for I/O > and completed. Same as right now. I have not changed anything. > The log force will then submit the commit record log, but > then not wait on it. This is obviously not a functional problem since a > CIL push was never guaranteed to not block at all (iclogs are a limited > resource), but there's clearly a change in behavior here worth reasoning > about for async forces.. No, there is no change in behaviour at all. The log force in both cases submits the commit record. > > > I.e., if we have a several hundred MB > > > CIL context, > > > > Not possible. A 2GB log has a hard throttle limit now of 64MB. > > > > Well, I don't think the throttle guarantees an upper bound on the > context size, but that's a different topic. Sure, it's a bit lazy. But once you capture the user tasks that run commits on the throttle, the size stops growing. > That was just an example of > a largish sized checkpoint. 64MB is still plenty enough to require > multiple passes through the set of iclogs. Regardless, I think the > before and after change is not significant given that a 64MB checkpoint > already requires internal blocking to reuse iclog space as the > checkpoint flushes anyways. So it's really just (l_iclog_bufs - 1) of > additional I/O we're waiting for in the worst case. I put the metrics that prove this assertion wrong are in the commit message of the next patch. i.e. I measured how many "noiclog" events are occuring during CIL checkpoints. I mentioned that above already - reducing the IO blocking rate from hundreds of events per checkpoint down to a small handful is a major win.... Cheers, Dave.
On 28 Jan 2021 at 10:11, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > To allow for iclog IO device cache flush behaviour to be optimised, > we first need to separate out the commit record iclog IO from the > rest of the checkpoint so we can wait for the checkpoint IO to > complete before we issue the commit record. > > This separate is only necessary if the commit record is being > written into a different iclog to the start of the checkpoint. If > the entire checkpoint and commit is in the one iclog, then they are > both covered by the one set of cache flush primitives on the iclog > and hence there is no need to separate them. > > Otherwise, we need to wait for all the previous iclogs to complete > so they are ordered correctly and made stable by the REQ_PREFLUSH > that the commit record iclog IO issues. This guarantees that if a > reader sees the commit record in the journal, they will also see the > entire checkpoint that commit record closes off. > > This also provides the guarantee that when the commit record IO > completes, we can safely unpin all the log items in the checkpoint > so they can be written back because the entire checkpoint is stable > in the journal. > W.r.t correctness of the changes in this patch, Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 47 +++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_log_cil.c | 7 +++++++ > fs/xfs/xfs_log_priv.h | 2 ++ > 3 files changed, 56 insertions(+) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index c5f507c24577..c5e3da23961c 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -808,6 +808,53 @@ xlog_wait_on_iclog( > return 0; > } > > +/* > + * Wait on any iclogs that are still flushing in the range of start_lsn to > + * the current iclog's lsn. The caller holds a reference to the iclog, but > + * otherwise holds no log locks. > + * > + * We walk backwards through the iclogs to find the iclog with the highest lsn > + * in the range that we need to wait for and then wait for it to complete. > + * Completion ordering of iclog IOs ensures that all prior iclogs to this IO are > + * complete by the time our candidate has completed. > + */ > +int > +xlog_wait_on_iclog_lsn( > + struct xlog_in_core *iclog, > + xfs_lsn_t start_lsn) > +{ > + struct xlog *log = iclog->ic_log; > + struct xlog_in_core *prev; > + int error = -EIO; > + > + spin_lock(&log->l_icloglock); > + if (XLOG_FORCED_SHUTDOWN(log)) > + goto out_unlock; > + > + error = 0; > + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { > + > + /* Done if the lsn is before our start lsn */ > + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), > + start_lsn) < 0) > + break; > + > + /* Don't need to wait on completed, clean iclogs */ > + if (prev->ic_state == XLOG_STATE_DIRTY || > + prev->ic_state == XLOG_STATE_ACTIVE) { > + continue; > + } > + > + /* wait for completion on this iclog */ > + xlog_wait(&prev->ic_force_wait, &log->l_icloglock); > + return 0; > + } > + > +out_unlock: > + spin_unlock(&log->l_icloglock); > + return error; > +} > + > /* > * Write out an unmount record using the ticket provided. We have to account for > * the data space used in the unmount ticket as this write is not done from a > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index b0ef071b3cb5..c5cc1b7ad25e 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -870,6 +870,13 @@ xlog_cil_push_work( > wake_up_all(&cil->xc_commit_wait); > spin_unlock(&cil->xc_push_lock); > > + /* > + * If the checkpoint spans multiple iclogs, wait for all previous > + * iclogs to complete before we submit the commit_iclog. > + */ > + if (ctx->start_lsn != commit_lsn) > + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); > + > /* release the hounds! */ > xfs_log_release_iclog(commit_iclog); > return; > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 037950cf1061..a7ac85aaff4e 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -584,6 +584,8 @@ xlog_wait( > remove_wait_queue(wq, &wait); > } > > +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn); > + > /* > * The LSN is valid so long as it is behind the current LSN. If it isn't, this > * means that the next log record that includes this metadata could have a
On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > To allow for iclog IO device cache flush behaviour to be optimised, > we first need to separate out the commit record iclog IO from the > rest of the checkpoint so we can wait for the checkpoint IO to > complete before we issue the commit record. > > This separate is only necessary if the commit record is being s/separate/separation/g > written into a different iclog to the start of the checkpoint. If > the entire checkpoint and commit is in the one iclog, then they are > both covered by the one set of cache flush primitives on the iclog > and hence there is no need to separate them. > > Otherwise, we need to wait for all the previous iclogs to complete > so they are ordered correctly and made stable by the REQ_PREFLUSH > that the commit record iclog IO issues. This guarantees that if a > reader sees the commit record in the journal, they will also see the > entire checkpoint that commit record closes off. > > This also provides the guarantee that when the commit record IO > completes, we can safely unpin all the log items in the checkpoint > so they can be written back because the entire checkpoint is stable > in the journal. I'm a little worried about the direction for devices without a volatile write cache like all highend enterprise SSDs, Arrays and hard drives, where we not introduce another synchronization point without any gains from the reduction in FUA/flush traffic that is a no-op there.
On Sat, Jan 30, 2021 at 09:25:59AM +1100, Dave Chinner wrote: > On Fri, Jan 29, 2021 at 09:58:51AM -0500, Brian Foster wrote: > > On Fri, Jan 29, 2021 at 08:22:21AM +1100, Dave Chinner wrote: > > > On Thu, Jan 28, 2021 at 10:07:41AM -0500, Brian Foster wrote: > > > > FYI, I finally got to reading your prior log cache flushing thread > > > > yesterday afternoon. I was planning to revisit that and probably reply > > > > this morning after having some time to digest, but saw this and so will > > > > reply here.. > > > > > > > > On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > To allow for iclog IO device cache flush behaviour to be optimised, > > > > > we first need to separate out the commit record iclog IO from the > > > > > rest of the checkpoint so we can wait for the checkpoint IO to > > > > > complete before we issue the commit record. > > > > > > > > > > This separate is only necessary if the commit record is being > > > > > written into a different iclog to the start of the checkpoint. If > > > > > the entire checkpoint and commit is in the one iclog, then they are > > > > > both covered by the one set of cache flush primitives on the iclog > > > > > and hence there is no need to separate them. > > > > > > > > > > > > > I find the description here a bit vague.. we have to separate out the > > > > commit record I/O, but only if it's already separate..? > > > > > > Yes, because if the commit record is in a different iclog to the > > > start of the checkpoint, we have to ensure that the start of the > > > checkpoint has been written and completed before we issue a > > > PREFLUSH, otherwise the cache flush is not guaranteed to capture > > > that IO. i.e. the block layer cache flushing mechanisms do not order > > > against IOs in flight, only completed IOs. Hence if the commit > > > record is in a different iclog, we have to wait for the other iclogs > > > to complete before flushign the cache. If the checkpoint is wholly > > > contained within a single iclog, then we don't have to wait for > > > anything, the single journal IO has all the cache flushes we need. > > > > > > > Yeah, I get the functional behavior. I think the commit log would do > > better to refer to serializing or ordering the commit record (when split > > or separate) as opposed to separating it. > > Except we already have "ordering the commit record" functionality > implemented in the CIL push code, where it explicitly orders the > commit record for the push against the commit records of other > pushes in progress. > > SO I don't think naming it "ordering the commit record" improves the > situation because it introduces ambiguity into what that means. It's > also not serialisation... > I'm not sure we can get much more vague than "separate commit record IO." :P We can fairly easily distinguish between ordering of independent commit records and ordering a particular commit record against log I/O of its associated checkpoint. For example: xfs: block commit record iclog submission on previous iclog completion ... but feel free to come up with something better, use the term ordering instead of blocking, etc. > > > > > +int > > > > > +xlog_wait_on_iclog_lsn( > > > > > + struct xlog_in_core *iclog, > > > > > + xfs_lsn_t start_lsn) > > > > > +{ > > > > > + struct xlog *log = iclog->ic_log; > > > > > + struct xlog_in_core *prev; > > > > > + int error = -EIO; > > > > > + > > > > > + spin_lock(&log->l_icloglock); > > > > > + if (XLOG_FORCED_SHUTDOWN(log)) > > > > > + goto out_unlock; > > > > > + > > > > > + error = 0; > > > > > + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { > > > > > + > > > > > + /* Done if the lsn is before our start lsn */ > > > > > + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), > > > > > + start_lsn) < 0) > > > > > + break; > > > > > > > > Hmm.. that logic looks a bit dodgy when you consider that the iclog > > > > header lsn is reset to zero on activation. I think it actually works as > > > > intended because iclog completion reactivates iclogs in LSN order and > > > > this loop walks in reverse order, but that is a very subtle connection > > > > that might be useful to document. > > > > > > It is documented in the comment above the function "We walk > > > backwards through the iclogs to find the iclog....". A newly > > > activated iclog will have an LSN of zero, and that means there are > > > no prior iclogs in the list in the range we need to flush because of > > > the completion ordering guarantees we have for iclog IO (i.e. they > > > are always completed in ring order, not IO completion order). > > > > > > > Can you update the comment with that sentence to be more explicit about > > the zero LSN case? > > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > > index b0ef071b3cb5..c5cc1b7ad25e 100644 > > > > > --- a/fs/xfs/xfs_log_cil.c > > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > > @@ -870,6 +870,13 @@ xlog_cil_push_work( > > > > > wake_up_all(&cil->xc_commit_wait); > > > > > spin_unlock(&cil->xc_push_lock); > > > > > > > > > > + /* > > > > > + * If the checkpoint spans multiple iclogs, wait for all previous > > > > > + * iclogs to complete before we submit the commit_iclog. > > > > > + */ > > > > > + if (ctx->start_lsn != commit_lsn) > > > > > + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); > > > > > + > > > > > > > > This is an interesting serialization point because we don't necessarily > > > > submit the iclog that holds the commit record. I actually think in most > > > > cases the separate commit record won't land on disk until the next CIL > > > > push causes a sync of the current head iclog (assuming it fills it up). > > > > > > Which is exactly what happens now. It's not a big deal, because if > > > we are running a sync transaction, the log force will push it to > > > disk and wait for it. CIL pushes are not synchronous and never have > > > been - the log force is what creates the appearance of synchronous > > > behaviour to the caller... > > > > > > > Right, but otherwise the CIL push blocks on previous checkpoints (and > > now the majority of the current checkpoint) only to write the commit > > record into an iclog and probably not submit it until the next CIL > > checkpoint. > > That's exactly the same behaviour we have now. Nothing guarantees > that the iclog containing a commit record is written immediately. > So this simply isn't a concern because we are not changing > behaviour. > > > I agree it's probably not a big deal. The same behavior > > likely repeats for subsequent CIL pushes, so it may not even be > > noticeable. This is just an observation that we're compounding more > > serialization > > This does not add serialisation! > > Serialisation means you do something that will stall *other tasks* > because they have to wait for the current task to finish before they > can proceed. Adding an _ordering delay_ into an unserialised task > does not create new serialisation points, it just lengthens the > runtime of that task. > Perhaps serialization is not the right term... I/O synchronization perhaps? I don't know, but calling it an ordering delay is certainly not comprehensive because it has functional ramifications.. > > on another bit of a historical impedance mismatch between > > delayed logging and the underlying layer. We've potentially just wrote > > out a large context, yet none of the attached items can be processed > > until another checkpoint occurs because the callbacks that do AIL > > insertion and whatnot don't run until the iclog with the commit record > > hits the disk. > > Yup, classic two phase commit algorithm behaviour right there. We > already have this occurring and it most definitely is not a problem > that we need to solve. Indeed, this behaviour is the prime reason > for log covering to exist... > > > This isn't necessarily a commentary on this patch, but it would be nice > > if we could start thinking about handling that situation better; > > particularly if we've just blocked waiting for the majority of that > > checkpoint I/O to complete. > > The thing is, we are already blocking waiting for iclog IO to > complete. We're doing it at a rate of at lease 1 in every 4 iclogs > we need to fill in long running checkpoints. So the CIL push that > needs to write a thousand iclogs before the commit record is written > has already blocked on iclog IO completion at least 250 times. > > In comaprison, with the cache flush changes and this commit record > separation, I'm seeing the iclog block rate drop to 1 in 100 iclogs > written. The fact is that cache flushing is -far more expensive- and > results is far more blocking and IO completion waiting in the CIL > push than this code that orders the commit record by waiting for > iclog completion to occur. > What exactly is "iclog block rate?" Does the calculation of the new rate factor in the implicit iclog blocking we now have by default? > CIL checkpoint submission is far more IO efficient and *much lower > latency* with this single software level ordering delay than using > hardware cache flushes to guarantee ordering. To focus on "this > new ordering delay might add latency" completely misses the bigger > picture that we just replaced 2-3 orders of magnitude of IO blocking > delays with a single ordering delay. > > That's what's important here - it's not that we've added a single > ordering delay into the checkpoint, but that we've removed > *hundreds* of IO blocking delays in the checkpoint writing process. > Ok. I think Christoph raised a good point around storage without volatile caches, but otherwise I find the performance tradeoff analysis reasonable for the applicable case. Thanks. > > That means there's an increased chance that > > the next iclog in the ring may be active. Perhaps we could introduce > > some logic to switch out the commit record iclog before the CIL push > > returns in those particular cases. For example, switch out if the > > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the > > next iclog in the ring is clean..? > > We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, > but then we can't aggregate more changes into it and fill it up. If > someone starts waiting on that iclog (i.e. a log force), then it > immediately gets marked WANT_SYNC and submitted to disk when it is > released. But if there is no-one waiting on it, then we largely > don't care if an asynchronous checkpoint is committed immediately, > at the start of the next checkpoint, or at worst, within 30s when > the log worker next kicks a log force.... > > It's just not necessary to ensure the commit record hits the disk > with minimal latency... > > > > > From a performance perspective, it seems like this makes the CIL push > > > > _nearly_ synchronous by default. > > > > > > Nope, not at all. There is nothing waiting on the CIL push to > > > complete, except maybe a log force waiting for the iclogs to be > > > synced to disk and completed. > > > > > > > Right, that's sort of the point... we have various places that issue > > async log forces. These async log forces push the CIL, which > > historically only ever blocks on log I/O by virtue of waiting on iclog > > space. IOW, there is a bit of runway between an async CIL push blocking > > on I/O or not provided by the iclog mechanism. > > > > With this patch, a CIL push (and thus an async log force) now does not > > return until all but the commit record iclog have been submitted for I/O > > and completed. > > Same as right now. I have not changed anything. > > > The log force will then submit the commit record log, but > > then not wait on it. This is obviously not a functional problem since a > > CIL push was never guaranteed to not block at all (iclogs are a limited > > resource), but there's clearly a change in behavior here worth reasoning > > about for async forces.. > > No, there is no change in behaviour at all. The log force in both > cases submits the commit record. > But it also blocks where it didn't before. I'm not sure why I can't seem to get this point across. Having thought more about this, here is a more concrete example of the behavior change wrt to async log forces: - Various operations populate the CIL, eventually triggers background CIL push. - Background CIL push writes out the majority of checkpoint A, writes the commit record of checkpoint A into the current head iclog and returns without submitting it for I/O. - More transactions commit and some task issues an async log force (i.e. xfs_log_force(.., 0);): - xlog_cil_push_work() submits the iclog with the commit record for checkpoint A as it writes out checkpoint B. - xlog_wait_on_iclog_lsn() waits on ->ic_force_wait of the commit record iclog for checkpoint A. ... At this point, we've blocked async log force on completion processing of a previous checkpoint because ->ic_force_wait isn't woken until the iclog is cleaned, which doesn't occur until completion callbacks attached to the iclog complete. This means we've technically just changed the locking semantics for async log forces as callers must be cognizant of locks used in the log I/O completion path. Is this a problem? I don't know. Perhaps not, but we've had similar issues in the past[1] in cases where an async log force unexpectedly blocked on callback processing. The example cited was a different waitqueue associated with the CIL context as well as being a shutdown case that I think is no longer relevant due to more recent inode flush reworks, etc., but it's just an example that raises concern. So unless I'm missing something in the sequence above, I think this requires more audit and analysis than we currently have for this particular patch. I.e., we still have oddball cases like how xfs_buf_lock() of a stale buffer invokes an async log force (while potentially holding other locked items/buffers in a transaction) while similarly an unpin of a staled buffer can cycle the buffer lock on the log I/O completion side. Is that safe or a rathole of landmines (now and for future code)? Perhaps a more robust option is to create a new waitqueue specifically for the purpose of blocking iclog I/O so historical async force rules/behavior is preserved. (We do have ->ic_write_wait, though that doesn't appear to provide ordering..). Brian [1] 545aa41f5cba ("xfs: wake commit waiters on CIL abort before log item abort") > > > > I.e., if we have a several hundred MB > > > > CIL context, > > > > > > Not possible. A 2GB log has a hard throttle limit now of 64MB. > > > > > > > Well, I don't think the throttle guarantees an upper bound on the > > context size, but that's a different topic. > > Sure, it's a bit lazy. But once you capture the user tasks that run > commits on the throttle, the size stops growing. > > > That was just an example of > > a largish sized checkpoint. 64MB is still plenty enough to require > > multiple passes through the set of iclogs. Regardless, I think the > > before and after change is not significant given that a 64MB checkpoint > > already requires internal blocking to reuse iclog space as the > > checkpoint flushes anyways. So it's really just (l_iclog_bufs - 1) of > > additional I/O we're waiting for in the worst case. > > I put the metrics that prove this assertion wrong are in the commit > message of the next patch. i.e. I measured how many "noiclog" events > are occuring during CIL checkpoints. I mentioned that above already > - reducing the IO blocking rate from hundreds of events per > checkpoint down to a small handful is a major win.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index c5f507c24577..c5e3da23961c 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -808,6 +808,53 @@ xlog_wait_on_iclog( return 0; } +/* + * Wait on any iclogs that are still flushing in the range of start_lsn to + * the current iclog's lsn. The caller holds a reference to the iclog, but + * otherwise holds no log locks. + * + * We walk backwards through the iclogs to find the iclog with the highest lsn + * in the range that we need to wait for and then wait for it to complete. + * Completion ordering of iclog IOs ensures that all prior iclogs to this IO are + * complete by the time our candidate has completed. + */ +int +xlog_wait_on_iclog_lsn( + struct xlog_in_core *iclog, + xfs_lsn_t start_lsn) +{ + struct xlog *log = iclog->ic_log; + struct xlog_in_core *prev; + int error = -EIO; + + spin_lock(&log->l_icloglock); + if (XLOG_FORCED_SHUTDOWN(log)) + goto out_unlock; + + error = 0; + for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) { + + /* Done if the lsn is before our start lsn */ + if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn), + start_lsn) < 0) + break; + + /* Don't need to wait on completed, clean iclogs */ + if (prev->ic_state == XLOG_STATE_DIRTY || + prev->ic_state == XLOG_STATE_ACTIVE) { + continue; + } + + /* wait for completion on this iclog */ + xlog_wait(&prev->ic_force_wait, &log->l_icloglock); + return 0; + } + +out_unlock: + spin_unlock(&log->l_icloglock); + return error; +} + /* * Write out an unmount record using the ticket provided. We have to account for * the data space used in the unmount ticket as this write is not done from a diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b0ef071b3cb5..c5cc1b7ad25e 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -870,6 +870,13 @@ xlog_cil_push_work( wake_up_all(&cil->xc_commit_wait); spin_unlock(&cil->xc_push_lock); + /* + * If the checkpoint spans multiple iclogs, wait for all previous + * iclogs to complete before we submit the commit_iclog. + */ + if (ctx->start_lsn != commit_lsn) + xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn); + /* release the hounds! */ xfs_log_release_iclog(commit_iclog); return; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 037950cf1061..a7ac85aaff4e 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -584,6 +584,8 @@ xlog_wait( remove_wait_queue(wq, &wait); } +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn); + /* * The LSN is valid so long as it is behind the current LSN. If it isn't, this * means that the next log record that includes this metadata could have a