xfs: Wake CIL push waiters more reliably

Message ID	20201229235627.33289-1-buczek@molgen.mpg.de (mailing list archive)
State	Deferred, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Donald Buczek <buczek@molgen.mpg.de> To: linux-xfs@vger.kernel.org, Brian Foster <bfoster@redhat.com>, Dave Chinner <david@fromorbit.com> Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, it+linux-xfs@molgen.mpg.de, Donald Buczek <buczek@molgen.mpg.de> Subject: [PATCH] xfs: Wake CIL push waiters more reliably Date: Wed, 30 Dec 2020 00:56:27 +0100 Message-Id: <20201229235627.33289-1-buczek@molgen.mpg.de> In-Reply-To: <1705b481-16db-391e-48a8-a932d1f137e7@molgen.mpg.de> References: <1705b481-16db-391e-48a8-a932d1f137e7@molgen.mpg.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: Wake CIL push waiters more reliably \| expand xfs: Wake CIL push waiters more reliably

Donald Buczek Dec. 29, 2020, 11:56 p.m. UTC

Threads, which committed items to the CIL, wait in the xc_push_wait
waitqueue when used_space in the push context goes over a limit. These
threads need to be woken when the CIL is pushed.

The CIL push worker tries to avoid the overhead of calling wake_all()
when there are no waiters waiting. It does so by checking the same
condition which caused the waits to happen. This, however, is
unreliable, because ctx->space_used can actually decrease when items are
recommitted. If the value goes below the limit while some threads are
already waiting but before the push worker gets to it, these threads are
not woken.

Always wake all CIL push waiters. Test with waitqueue_active() as an
optimization. This is possible, because we hold the xc_push_lock
spinlock, which prevents additions to the waitqueue.

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
---
 fs/xfs/xfs_log_cil.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Dave Chinner Dec. 30, 2020, 10:16 p.m. UTC | #1

On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> Threads, which committed items to the CIL, wait in the xc_push_wait
> waitqueue when used_space in the push context goes over a limit. These
> threads need to be woken when the CIL is pushed.
> 
> The CIL push worker tries to avoid the overhead of calling wake_all()
> when there are no waiters waiting. It does so by checking the same
> condition which caused the waits to happen. This, however, is
> unreliable, because ctx->space_used can actually decrease when items are
> recommitted.

When does this happen?

Do you have tracing showing the operation where the relogged item
has actually gotten smaller? By definition, relogging in the CIL
should only grow the size of the object in the CIL because it must
relog all the existing changes on top of the new changed being made
to the object. Hence the CIL reservation should only ever grow.

IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further investigation,
not papering over the occurrence with broadcast wakeups...

> If the value goes below the limit while some threads are
> already waiting but before the push worker gets to it, these threads are
> not woken.
> 
> Always wake all CIL push waiters. Test with waitqueue_active() as an
> optimization. This is possible, because we hold the xc_push_lock
> spinlock, which prevents additions to the waitqueue.
> 
> Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> ---
>  fs/xfs/xfs_log_cil.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..d620de8e217c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,7 +670,7 @@ xlog_cil_push_work(
>  	/*
>  	 * Wake up any background push waiters now this context is being pushed.
>  	 */
> -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> +	if (waitqueue_active(&cil->xc_push_wait))
>  		wake_up_all(&cil->xc_push_wait);

That just smells wrong to me. It *might* be correct, but this
condition should pair with the sleep condition, as space used by a
CIL context should never actually decrease....

Cheers,

Dave.

Donald Buczek Dec. 31, 2020, 11:48 a.m. UTC | #2

On 30.12.20 23:16, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
>> Threads, which committed items to the CIL, wait in the xc_push_wait
>> waitqueue when used_space in the push context goes over a limit. These
>> threads need to be woken when the CIL is pushed.
>>
>> The CIL push worker tries to avoid the overhead of calling wake_all()
>> when there are no waiters waiting. It does so by checking the same
>> condition which caused the waits to happen. This, however, is
>> unreliable, because ctx->space_used can actually decrease when items are
>> recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.

I have (very ugly printk based) log (see below), but it only shows, that it happened (space_used decreasing), not what caused it.

I only browsed the ( xfs_*_item.c ) code and got the impression that the size of a log item is rather dynamic (e.g. number of extends in an inode, extended attributes in an inode, continuity of chunks in a buffer) and wasn't surprised that a relogged item might need less space from time to time.

> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...

One could argue that the code is more robust after the change, because it wakes up every thread which is waiting on the next push to happen when the next push is happening without making assumption of why these threads are waiting by duplicating code from that waiters side. The proposed waitqueue_active() is inlined to two instructions and avoids the call overhead if there are no waiters as well.

But, of course, if the size is not expected to decrease, this discrepancy must be clarified anyway. I am sure, I can find out the exact circumstances, this does happen. I will follow up on this.

Happy New Year!

   Donald

Log extract following....

========

+++ b/fs/xfs/xfs_log_cil.c
@@ -17,6 +17,8 @@
  #include "xfs_log_priv.h"
  #include "xfs_trace.h"
  
+#include <linux/printk.h>
+
  struct workqueue_struct *xfs_discard_wq;
  
  /*
@@ -670,8 +672,25 @@ xlog_cil_push_work(
         /*
          * Wake up any background push waiters now this context is being pushed.
          */
-       if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+       if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
                 wake_up_all(&cil->xc_push_wait);
+               pr_warn("XXX wake    cil %p ctx %p  ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, ctx,
+                       ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       ctx->sequence);
+       } else {
+               pr_warn("XXX no wake cil %p ctx %p ctx->space_used=%d < %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, ctx,
+                       ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       ctx->sequence);
+               if (waitqueue_active(&cil->xc_push_wait)) {
+                       pr_warn("XXX xc_push_wait ACTIVE!\n");
+               }
+       }
  
         /*
          * Check if we've anything to push. If there is nothing, then we don't
@@ -918,6 +937,11 @@ xlog_cil_push_background(
         spin_lock(&cil->xc_push_lock);
         if (cil->xc_push_seq < cil->xc_current_sequence) {
                 cil->xc_push_seq = cil->xc_current_sequence;
+               pr_warn("XXX trigger cil %p ctx %p  ctx->space_used=%d      , push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, cil->xc_ctx,
+                       cil->xc_ctx->space_used,
+                       cil->xc_push_seq,
+                       cil->xc_ctx->sequence);
                 queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
         }
  
@@ -936,6 +960,12 @@ xlog_cil_push_background(
         if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
                 trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
                 ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+               pr_warn("XXX pushw   cil %p ctx %p  ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, cil->xc_ctx,
+                       cil->xc_ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       cil->xc_ctx->sequence);
                 xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
                 return;
         }

=====

14606 "XXX" lines logged before deadlock
4847 "XXX" lines logged for the deadlocked filesystem (CIL 00000000e374c6f1) before deadlock

relevant lines:

# seq 29

2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.392095+01:00 deadbird kernel: [ 1071.600503] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109068 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.392120+01:00 deadbird kernel: [ 1071.616245] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.408102+01:00 deadbird kernel: [ 1071.632024] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.425086+01:00 deadbird kernel: [ 1071.648913] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.442109+01:00 deadbird kernel: [ 1071.666410] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.482769+01:00 deadbird kernel: [ 1071.706666] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.501063+01:00 deadbird kernel: [ 1071.725484] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.535711+01:00 deadbird kernel: [ 1071.742926] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.535728+01:00 deadbird kernel: [ 1071.760346] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.552079+01:00 deadbird kernel: [ 1071.776167] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.569070+01:00 deadbird kernel: [ 1071.793576] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
[ 953 more of these ...]
2020-12-29T20:08:50.368192+01:00 deadbird kernel: [ 1088.576346] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.385023+01:00 deadbird kernel: [ 1088.593009] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.418318+01:00 deadbird kernel: [ 1088.609811] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.418330+01:00 deadbird kernel: [ 1088.626431] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29

# seq 30

2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.507085+01:00 deadbird kernel: [ 1180.715657] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108892 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.507094+01:00 deadbird kernel: [ 1180.731757] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108876 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.659070+01:00 deadbird kernel: [ 1180.867812] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.771081+01:00 deadbird kernel: [ 1180.980187] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.791116+01:00 deadbird kernel: [ 1180.996535] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:32.512085+01:00 deadbird kernel: [ 1190.725044] XXX no wake cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108856 < 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:32.528119+01:00 deadbird kernel: [ 1190.753321] XXX xc_push_wait ACTIVE!

After this "xc_push_wait ACTIVE!" the I/O to the filesystem ceased.

space_used was reduced from 67108924 .. 67108908 .. 67108892 .. 67108876 .. 67108872 as seen by waiters to 67108856 as seen by the push worker. Presumably by other calls to xfs_log_commit_cil which did not produce log, because they neither triggered the worker nor waitetd for the push.

Dave Chinner Dec. 31, 2020, 9:59 p.m. UTC | #3

On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when used_space in the push context
> > > goes over a limit. These threads need to be woken when the CIL
> > > is pushed.
> > > 
> > > The CIL push worker tries to avoid the overhead of calling
> > > wake_all() when there are no waiters waiting. It does so by
> > > checking the same condition which caused the waits to happen.
> > > This, however, is unreliable, because ctx->space_used can
> > > actually decrease when items are recommitted.
> > 
> > When does this happen?
> > 
> > Do you have tracing showing the operation where the relogged
> > item has actually gotten smaller? By definition, relogging in
> > the CIL should only grow the size of the object in the CIL
> > because it must relog all the existing changes on top of the new
> > changed being made to the object. Hence the CIL reservation
> > should only ever grow.
> 
> I have (very ugly printk based) log (see below), but it only
> shows, that it happened (space_used decreasing), not what caused
> it.
> 
> I only browsed the ( xfs_*_item.c ) code and got the impression
> that the size of a log item is rather dynamic (e.g. number of
> extends in an inode, extended attributes in an inode, continuity
> of chunks in a buffer) and wasn't surprised that a relogged item
> might need less space from time to time.
> 
> > IOWs, returning negative lengths from the formatting code is
> > unexpected and probably a bug and requires further
> > investigation, not papering over the occurrence with broadcast
> > wakeups...
> 
> One could argue that the code is more robust after the change,
> because it wakes up every thread which is waiting on the next push
> to happen when the next push is happening without making
> assumption of why these threads are waiting by duplicating code
> from that waiters side. The proposed waitqueue_active() is inlined
> to two instructions and avoids the call overhead if there are no
> waiters as well.

One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)

> 
> # seq 29
> 
> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29

So, at 20:08:15 we get a push trigger and the work is queued. But...

.....
> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29

It takes the best part of *50 seconds* before the push work actually
runs?

That's .... well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...

> # seq 30
> 
> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30

20:09:39 for the next trigger,

> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30

So it dropped by 16 bytes (seems to be common) which is unexpected.
I wonder if it filled a hole in a buffer and so needed one less
xlog_op_header()? But then the size would have gone up by at least
128 bytes for the hole that was filled, so it still shouldn't go
down in size.

I think you need to instrument xlog_cil_insert_items() and catch
a negative length here:

	/* account for space used by new iovec headers  */
	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
	len += iovhdr_res;
	ctx->nvecs += diff_iovecs;

(diff_iovecs will be negative if the number of xlog_op_header
structures goes down)

And if this happens, then dump the transaction ticket via
xlog_print_trans(tp) so we can see all the log items types and
vectors that the transaction has formatted...

> 2020-12-29T20:10:22.507085+01:00 deadbird kernel: [ 1180.715657] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108892 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.507094+01:00 deadbird kernel: [ 1180.731757] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108876 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.659070+01:00 deadbird kernel: [ 1180.867812] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.771081+01:00 deadbird kernel: [ 1180.980187] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.791116+01:00 deadbird kernel: [ 1180.996535] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.512085+01:00 deadbird kernel: [ 1190.725044] XXX no wake cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108856 < 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.528119+01:00 deadbird kernel: [ 1190.753321] XXX xc_push_wait ACTIVE!

Also, another 50s hold-off from push work being queued to the work
actually running. That also needs to be understood, because that's
clearly contributing to hitting the hard limit regularly and that
should mostly never happen....

Cheers,

Dave.

Donald Buczek Jan. 2, 2021, 7:12 p.m. UTC | #4

On 31.12.20 22:59, Dave Chinner wrote:

Hey, funny, your email could celebrate New Year a second time :-)

> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>> On 30.12.20 23:16, Dave Chinner wrote:
>>> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
>>>> Threads, which committed items to the CIL, wait in the
>>>> xc_push_wait waitqueue when used_space in the push context
>>>> goes over a limit. These threads need to be woken when the CIL
>>>> is pushed.
>>>>
>>>> The CIL push worker tries to avoid the overhead of calling
>>>> wake_all() when there are no waiters waiting. It does so by
>>>> checking the same condition which caused the waits to happen.
>>>> This, however, is unreliable, because ctx->space_used can
>>>> actually decrease when items are recommitted.
>>>
>>> When does this happen?
>>>
>>> Do you have tracing showing the operation where the relogged
>>> item has actually gotten smaller? By definition, relogging in
>>> the CIL should only grow the size of the object in the CIL
>>> because it must relog all the existing changes on top of the new
>>> changed being made to the object. Hence the CIL reservation
>>> should only ever grow.
>>
>> I have (very ugly printk based) log (see below), but it only
>> shows, that it happened (space_used decreasing), not what caused
>> it.
>>
>> I only browsed the ( xfs_*_item.c ) code and got the impression
>> that the size of a log item is rather dynamic (e.g. number of
>> extends in an inode, extended attributes in an inode, continuity
>> of chunks in a buffer) and wasn't surprised that a relogged item
>> might need less space from time to time.
>>
>>> IOWs, returning negative lengths from the formatting code is
>>> unexpected and probably a bug and requires further
>>> investigation, not papering over the occurrence with broadcast
>>> wakeups...
>>
>> One could argue that the code is more robust after the change,
>> because it wakes up every thread which is waiting on the next push
>> to happen when the next push is happening without making
>> assumption of why these threads are waiting by duplicating code
>> from that waiters side. The proposed waitqueue_active() is inlined
>> to two instructions and avoids the call overhead if there are no
>> waiters as well.
> 
> One could argue that, but one should also understand the design
> constraints for a particular algorithm are before suggesting that
> their solution is "robust". :)

Yes, but an understanding to the extend required by the argument should be sufficient :-)

>> # seq 29
>>
>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> 
> So, at 20:08:15 we get a push trigger and the work is queued. But...
> 
> .....
>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> 
> It takes the best part of *50 seconds* before the push work actually
> runs?
> 
> That's .... well and truly screwed up - the work should run on that
> CPU on the very next time it yeilds the CPU. If we're holding the
> CPU without yeilding it for that long, hangcheck and RCU warnings
> should be going off...

No such warnings.

But the load is probably I/O bound to the log:

- creates `cp -a` copies of a directory tree with small files (linux source repository)
- source tree probably completely cached.
- two copies in parallel
- slow log (on software raid6)

Isn't it to be expected that sooner or later you need to wait for log writes when you write as fast as possible with lots of metadata updates and not so much data?

I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) number of kernel worker threads waiting for the cil push and indirectly for log writes.

>> # seq 30
>>
>> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30
> 
> 20:09:39 for the next trigger,
> 
>> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30
> 
> So it dropped by 16 bytes (seems to be common) which is unexpected.
> I wonder if it filled a hole in a buffer and so needed one less
> xlog_op_header()? But then the size would have gone up by at least
> 128 bytes for the hole that was filled, so it still shouldn't go
> down in size.
> 
> I think you need to instrument xlog_cil_insert_items() and catch
> a negative length here:
> 
> 	/* account for space used by new iovec headers  */
> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> 	len += iovhdr_res;
> 	ctx->nvecs += diff_iovecs;
> 
> (diff_iovecs will be negative if the number of xlog_op_header
> structures goes down)
> 
> And if this happens, then dump the transaction ticket via
> xlog_print_trans(tp) so we can see all the log items types and
> vectors that the transaction has formatted...

I tried that, but the output was difficult to understand, because at that point you can only log the complete transaction with the items already updated. And a shrinking item is not switched to the shadow vector, so the formatted content is already overwritten and not available for analysis.

So I've added some code to dump an item in its old and its new state in xlog_cil_insert_format_items when either the requested buffer len or the number of vectors decreased.

Several examples of different kind with a little bit of manual annotation following

Best
   Donald

- XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes

[   29.606212] XFS (sda1): XXX required buf size 184 -> 172
[   29.612591] XFS (sda1): XXX niovecs           3 -> 3

[   29.618570] XFS (sda1): XXX old log item:
[   29.623469] XFS (sda1): log item:
[   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
[   29.632375] XFS (sda1):   flags      = 0x8
[   29.636858] XFS (sda1):   niovecs    = 3
[   29.647442] XFS (sda1):   size       = 312
[   29.651814] XFS (sda1):   bytes      = 184
[   29.656278] XFS (sda1):   buf len    = 184
[   29.660927] XFS (sda1):   iovec[0]
[   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
[   29.669592] XFS (sda1):     len      = 56
[   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
[   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
[   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   29.698633] XFS (sda1):   iovec[1]
[   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
[   29.707263] XFS (sda1):     len      = 96
[   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
[   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   29.736241] XFS (sda1):   iovec[2]
[   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
[   29.744873] XFS (sda1):     len      = 32
[   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
[   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
[   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X

[   29.773900] XFS (sda1): XXX new log item:
[   29.778718] XFS (sda1): log item:
[   29.782856] XFS (sda1):   type       = 0x123b
[   29.787478] XFS (sda1):   flags      = 0x8
[   29.791902] XFS (sda1):   niovecs    = 3
[   29.796321] XFS (sda1):   size       = 312
[   29.800640] XFS (sda1):   bytes      = 172
[   29.805052] XFS (sda1):   buf len    = 176
[   29.809659] XFS (sda1):   iovec[0]
[   29.813781] XFS (sda1):     type     = 0x5
[   29.818289] XFS (sda1):     len      = 56
[   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
[   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
[   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   29.847292] XFS (sda1):   iovec[1]
[   29.851420] XFS (sda1):     type     = 0x6
[   29.855933] XFS (sda1):     len      = 96
[   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
[   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   29.884949] XFS (sda1):   iovec[2]
[   29.889081] XFS (sda1):     type     = 0x9
[   29.893601] XFS (sda1):     len      = 20
[   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
[   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
[   29.913381] 00000010: 58 92 ef 04                                      X...

- (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes

[   29.982907] XFS (sda1): XXX required buf size 172 -> 160
[   29.992716] XFS (sda1): XXX niovecs           3 -> 3

[   29.998728] XFS (sda1): XXX old log item:
[   30.003624] XFS (sda1): log item:
[   30.007835] XFS (sda1):   type       = 0x123b
[   30.012654] XFS (sda1):   flags      = 0x8
[   30.017145] XFS (sda1):   niovecs    = 3
[   30.021638] XFS (sda1):   size       = 312
[   30.026012] XFS (sda1):   bytes      = 172
[   30.030610] XFS (sda1):   buf len    = 176
[   30.035292] XFS (sda1):   iovec[0]
[   30.039480] XFS (sda1):     type     = 0x5
[   30.044054] XFS (sda1):     len      = 56
[   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
[   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.073545] XFS (sda1):   iovec[1]
[   30.077744] XFS (sda1):     type     = 0x6
[   30.082455] XFS (sda1):     len      = 96
[   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.111801] XFS (sda1):   iovec[2]
[   30.115989] XFS (sda1):     type     = 0x9
[   30.120715] XFS (sda1):     len      = 20
[   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
[   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
[   30.140808] 00000010: 58 92 ef 04                                      X...

[   30.149006] XFS (sda1): XXX new log item:
[   30.154039] XFS (sda1): log item:
[   30.154039] XFS (sda1):   type       = 0x123b
[   30.154041] XFS (sda1):   flags      = 0x8
[   30.167436] XFS (sda1):   niovecs    = 3
[   30.167437] XFS (sda1):   size       = 312
[   30.167438] XFS (sda1):   bytes      = 160
[   30.180881] XFS (sda1):   buf len    = 160
[   30.180882] XFS (sda1):   iovec[0]
[   30.180882] XFS (sda1):     type     = 0x5
[   30.180883] XFS (sda1):     len      = 56
[   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
[   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.180886] XFS (sda1):   iovec[1]
[   30.180886] XFS (sda1):     type     = 0x6
[   30.180887] XFS (sda1):     len      = 96
[   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.180889] XFS (sda1):   iovec[2]
[   30.180890] XFS (sda1):     type     = 0x9
[   30.180890] XFS (sda1):     len      = 8
[   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
[   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

- (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2

[   30.197403] XFS (sda1): XXX required buf size 160 -> 152
[   30.296091] XFS (sda1): XXX niovecs           3 -> 2

[   30.296092] XFS (sda1): XXX old log item:
[   30.296093] XFS (sda1): log item:
[   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
[   30.317524] XFS (sda1):   type       = 0x123b
[   30.317524] XFS (sda1):   flags      = 0x8
[   30.317525] XFS (sda1):   niovecs    = 3
[   30.317525] XFS (sda1):   size       = 304
[   30.317526] XFS (sda1):   bytes      = 160
[   30.317526] XFS (sda1):   buf len    = 160
[   30.317527] XFS (sda1):   iovec[0]
[   30.317527] XFS (sda1):     type     = 0x5
[   30.317528] XFS (sda1):     len      = 56
[   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
[   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.317531] XFS (sda1):   iovec[1]
[   30.317531] XFS (sda1):     type     = 0x6
[   30.317531] XFS (sda1):     len      = 96
[   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.317533] XFS (sda1):   iovec[2]
[   30.317534] XFS (sda1):     type     = 0x9
[   30.317534] XFS (sda1):     len      = 8
[   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
[   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

[   30.317536] XFS (sda1): XXX new log item:
[   30.317537] XFS (sda1): log item:
[   30.317537] XFS (sda1):   type       = 0x123b
[   30.317538] XFS (sda1):   flags      = 0x8
[   30.317539] XFS (sda1):   niovecs    = 2
[   30.317539] XFS (sda1):   size       = 304
[   30.317540] XFS (sda1):   bytes      = 152
[   30.317540] XFS (sda1):   buf len    = 152
[   30.317541] XFS (sda1):   iovec[0]
[   30.317541] XFS (sda1):     type     = 0x5
[   30.317542] XFS (sda1):     len      = 56
[   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
[   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.317544] XFS (sda1):   iovec[1]
[   30.317544] XFS (sda1):     type     = 0x6
[   30.317545] XFS (sda1):     len      = 96
[   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
[   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................

- XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2

[   37.983756] XFS (sda1): XXX required buf size 168 -> 152
[   37.990253] XFS (sda1): XXX niovecs           3 -> 2

[   37.996202] XFS (sda1): XXX old log item:
[   38.001061] XFS (sda1): log item:
[   38.005239] XFS (sda1):   type       = 0x123b
[   38.009885] XFS (sda1):   flags      = 0x9
[   38.014330] XFS (sda1):   niovecs    = 3
[   38.018764] XFS (sda1):   size       = 440
[   38.023100] XFS (sda1):   bytes      = 168
[   38.027533] XFS (sda1):   buf len    = 168
[   38.032157] XFS (sda1):   iovec[0]
[   38.036286] XFS (sda1):     type     = 0x5
[   38.040796] XFS (sda1):     len      = 56
[   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
[   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
[   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   38.069853] XFS (sda1):   iovec[1]
[   38.073989] XFS (sda1):     type     = 0x6
[   38.078525] XFS (sda1):     len      = 96
[   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
[   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
[   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
[   38.107611] XFS (sda1):   iovec[2]
[   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
[   38.116285] XFS (sda1):     len      = 16
[   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
[   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...

[   38.136054] XFS (sda1): XXX new log item:
[   38.140878] XFS (sda1): log item:
[   38.145025] XFS (sda1):   type       = 0x123b
[   38.149645] XFS (sda1):   flags      = 0x9
[   38.154067] XFS (sda1):   niovecs    = 2
[   38.158490] XFS (sda1):   size       = 440
[   38.162799] XFS (sda1):   bytes      = 152
[   38.167202] XFS (sda1):   buf len    = 152
[   38.171801] XFS (sda1):   iovec[0]
[   38.175911] XFS (sda1):     type     = 0x5
[   38.180409] XFS (sda1):     len      = 56
[   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
[   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
[   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   38.209373] XFS (sda1):   iovec[1]
[   38.213488] XFS (sda1):     type     = 0x6
[   38.217990] XFS (sda1):     len      = 96
[   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
[   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
[   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.

- XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2

[ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
[ 2120.037257] XFS (md0): XXX niovecs           3 -> 2

[ 2120.043295] XFS (md0): XXX old log item:
[ 2120.048235] XFS (md0): log item:
[ 2120.052394] XFS (md0):   type        = 0x123c                                             # XFS_LI_BUF
[ 2120.057112] XFS (md0):   flags       = 0x8
[ 2120.061624] XFS (md0):   niovecs     = 3
[ 2120.066029] XFS (md0):   size        = 4120
[ 2120.070528] XFS (md0):   bytes       = 3992
[ 2120.075120] XFS (md0):   buf len     = 3992
[ 2120.079990] XFS (md0):   iovec[0]
[ 2120.084200] XFS (md0):     type      = 0x1                                                # XLOG_REG_TYPE_BFORMAT
[ 2120.088891] XFS (md0):     len       = 24
[ 2120.093287] XFS (md0):     first 24 bytes of iovec[0]:
[ 2120.099622] 00000000: 3c 12 03 00 00 50 08 00 78 dc ff 7f 04 00 00 00  <....P..x.......
[ 2120.109088] 00000010: 01 00 00 00 ff ff ff fe                          ........
[ 2120.117878] XFS (md0):   iovec[1]
[ 2120.122012] XFS (md0):     type      = 0x2                                                # XLOG_REG_TYPE_BCHUNK
[ 2120.126532] XFS (md0):     len       = 3072
[ 2120.131130] XFS (md0):     first 32 bytes of iovec[1]:
[ 2120.137376] 00000000: 58 44 42 33 00 00 00 00 00 00 00 04 7f ff dc 78  XDB3...........x
[ 2120.146740] 00000010: 00 00 00 00 00 00 00 00 58 03 ee 68 c3 6b 48 0c  ........X..h.kH.
[ 2120.156111] XFS (md0):   iovec[2]
[ 2120.160339] XFS (md0):     type      = 0x2                                                # XLOG_REG_TYPE_BCHUNK
[ 2120.164860] XFS (md0):     len       = 896
[ 2120.169354] XFS (md0):     first 32 bytes of iovec[2]:
[ 2120.175509] 00000000: 00 00 00 00 00 00 0b 80 00 00 00 2e 00 00 00 08  ................
[ 2120.184963] 00000010: 00 00 17 2e 00 00 00 0a 04 7c 79 34 00 00 00 16  .........|y4....

[ 2120.194426] XFS (md0): XXX new log item:
[ 2120.199236] XFS (md0): log item:
[ 2120.203459] XFS (md0):   type        = 0x123c
[ 2120.208064] XFS (md0):   flags       = 0x8
[ 2120.212465] XFS (md0):   niovecs     = 2
[ 2120.216949] XFS (md0):   size        = 4224
[ 2120.221434] XFS (md0):   bytes       = 4120
[ 2120.225921] XFS (md0):   buf len     = 4120
[ 2120.230695] XFS (md0):   iovec[0]
[ 2120.234780] XFS (md0):     type      = 0x1
[ 2120.239256] XFS (md0):     len       = 24
[ 2120.243619] XFS (md0):     first 24 bytes of iovec[0]:
[ 2120.249735] 00000000: 3c 12 02 00 00 50 08 00 78 dc ff 7f 04 00 00 00  <....P..x.......
[ 2120.259148] 00000010: 01 00 00 00 ff ff ff ff                          ........
[ 2120.267776] XFS (md0):   iovec[1]
[ 2120.271845] XFS (md0):     type      = 0x2
[ 2120.276287] XFS (md0):     len       = 4096
[ 2120.280803] XFS (md0):     first 32 bytes of iovec[1]:
[ 2120.286884] 00000000: 58 44 42 33 00 00 00 00 00 00 00 04 7f ff dc 78  XDB3...........x
[ 2120.296254] 00000010: 00 00 00 00 00 00 00 00 58 03 ee 68 c3 6b 48 0c  ........X..h.kH.

- only XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 2 to 1

[ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
[ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1

[ 2279.741360] XFS (sda1): XXX old log item:
[ 2279.746199] XFS (sda1): log item:
[ 2279.750512] XFS (sda1):   type       = 0x123c
[ 2279.755181] XFS (sda1):   flags      = 0x8
[ 2279.759644] XFS (sda1):   niovecs    = 2
[ 2279.764246] XFS (sda1):   size       = 256
[ 2279.768625] XFS (sda1):   bytes      = 152
[ 2279.773094] XFS (sda1):   buf len    = 152
[ 2279.777741] XFS (sda1):   iovec[0]
[ 2279.782044] XFS (sda1):     type     = 0x1
[ 2279.786607] XFS (sda1):     len      = 24
[ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
[ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
[ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
[ 2279.815287] XFS (sda1):   iovec[1]
[ 2279.819604] XFS (sda1):     type     = 0x2
[ 2279.824408] XFS (sda1):     len      = 128
[ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
[ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
[ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................

[ 2279.853964] XFS (sda1): XXX new log item:
[ 2279.858958] XFS (sda1): log item:
[ 2279.863149] XFS (sda1):   type       = 0x123c
[ 2279.867834] XFS (sda1):   flags      = 0x8
[ 2279.872410] XFS (sda1):   niovecs    = 1
[ 2279.876874] XFS (sda1):   size       = 256
[ 2279.881242] XFS (sda1):   bytes      = 24
[ 2279.885585] XFS (sda1):   buf len    = 24
[ 2279.890248] XFS (sda1):   iovec[0]
[ 2279.894408] XFS (sda1):     type     = 0x1
[ 2279.898954] XFS (sda1):     len      = 24
[ 2279.903291] XFS (sda1):     first 24 bytes of iovec[0]:
[ 2279.909608] 00000000: 3c 12 01 00 02 00 08 00 b0 41 f8 08 00 00 00 00  <........A......
[ 2279.918937] 00000010: 01 00 00 00 00 00 00 00                          ........

Log code for reference

commit 137ce3815d4d3ec47a2cf043a9c26aff1892eabe
Author: Donald Buczek <buczek@molgen.mpg.de>
Date:   Fri Jan 1 15:36:30 2021 +0100

     log negativ log items

diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 58c3fcbec94a..d2e6c7d78047 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -17,6 +17,7 @@ struct xfs_log_vec {
  	int			lv_bytes;	/* accounted space in buffer */
  	int			lv_buf_len;	/* aligned size of buffer */
  	int			lv_size;	/* size of allocated lv */
+	int			lv_debug_required_bytes;
  };
  
  #define XFS_LOG_VEC_ORDERED	(-1)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d620de8e217c..3931ae9c8996 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -17,6 +17,8 @@
  #include "xfs_log_priv.h"
  #include "xfs_trace.h"
  
+#include <linux/ratelimit_types.h>
+
  struct workqueue_struct *xfs_discard_wq;
  
  /*
@@ -133,6 +135,7 @@ xlog_cil_alloc_shadow_bufs(
  		int	nbytes = 0;
  		int	buf_size;
  		bool	ordered = false;
+		int	debug_required_bytes;
  
  		/* Skip items which aren't dirty in this transaction. */
  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
@@ -141,6 +144,8 @@ xlog_cil_alloc_shadow_bufs(
  		/* get number of vecs and size of data to be stored */
  		lip->li_ops->iop_size(lip, &niovecs, &nbytes);
  
+		debug_required_bytes = nbytes;
+
  		/*
  		 * Ordered items need to be tracked but we do not wish to write
  		 * them. We need a logvec to track the object, but we do not
@@ -208,6 +213,7 @@ xlog_cil_alloc_shadow_bufs(
  
  		/* Ensure the lv is set up according to ->iop_size */
  		lv->lv_niovecs = niovecs;
+		lv->lv_debug_required_bytes = debug_required_bytes;
  
  		/* The allocated data region lies beyond the iovec region */
  		lv->lv_buf = (char *)lv + xlog_cil_iovec_space(niovecs);
@@ -266,6 +272,39 @@ xfs_cil_prepare_item(
  		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
  }
  
+static void xlog_print_logitem(
+	struct xfs_log_item *lip) {
+
+	struct xfs_mount	*mp = lip->li_mountp;
+	struct xfs_log_vec	*lv = lip->li_lv;
+	struct xfs_log_iovec	*vec;
+	int			i;
+
+	xfs_warn(mp, "log item: ");
+	xfs_warn(mp, "  type	= 0x%x", lip->li_type);
+	xfs_warn(mp, "  flags	= 0x%lx", lip->li_flags);
+	if (!lv)
+		return;
+	xfs_warn(mp, "  niovecs	= %d", lv->lv_niovecs);
+	xfs_warn(mp, "  size	= %d", lv->lv_size);
+	xfs_warn(mp, "  bytes	= %d", lv->lv_bytes);
+	xfs_warn(mp, "  buf len	= %d", lv->lv_buf_len);
+
+	/* dump each iovec for the log item */
+	vec = lv->lv_iovecp;
+	for (i = 0; i < lv->lv_niovecs; i++) {
+		int dumplen = min(vec->i_len, 32);
+
+		xfs_warn(mp, "  iovec[%d]", i);
+		xfs_warn(mp, "    type	= 0x%x", vec->i_type);
+		xfs_warn(mp, "    len	= %d", vec->i_len);
+		xfs_warn(mp, "    first %d bytes of iovec[%d]:", dumplen, i);
+		xfs_hex_dump(vec->i_addr, dumplen);
+
+		vec++;
+	}
+}
+
  /*
   * Format log item into a flat buffers
   *
@@ -315,6 +354,8 @@ xlog_cil_insert_format_items(
  		struct xfs_log_vec *old_lv = NULL;
  		struct xfs_log_vec *shadow;
  		bool	ordered = false;
+		bool	want_dump = false;
+		static DEFINE_RATELIMIT_STATE(_rs, 1*HZ, 5);
  
  		/* Skip items which aren't dirty in this transaction. */
  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
@@ -334,6 +375,21 @@ xlog_cil_insert_format_items(
  
  		/* compare to existing item size */
  		old_lv = lip->li_lv;
+
+		if (lip->li_lv
+			&& (
+				shadow->lv_debug_required_bytes < old_lv->lv_bytes
+				|| shadow->lv_niovecs < old_lv->lv_niovecs )
+			&& __ratelimit(&_rs) ) {
+			xfs_warn(lip->li_mountp, "XXX required buf size %d -> %d",
+				old_lv->lv_bytes, shadow->lv_debug_required_bytes);
+			xfs_warn(lip->li_mountp, "XXX niovecs           %d -> %d",
+				old_lv->lv_niovecs, shadow->lv_niovecs);
+			xfs_warn(lip->li_mountp, "XXX old log item:");
+			xlog_print_logitem(lip);
+			want_dump = true;
+		}
+
  		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
  			/* same or smaller, optimise common overwrite case */
  			lv = lip->li_lv;
@@ -372,6 +428,10 @@ xlog_cil_insert_format_items(
  		lip->li_ops->iop_format(lip, lv);
  insert:
  		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
+		if (want_dump) {
+			xfs_warn(lip->li_mountp, "XXX new log item:");
+			xlog_print_logitem(lip);
+		}
  	}
  }

Dave Chinner Jan. 2, 2021, 10:44 p.m. UTC | #5

On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> On 31.12.20 22:59, Dave Chinner wrote:
> > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > On 30.12.20 23:16, Dave Chinner wrote:

> > One could argue that, but one should also understand the design
> > constraints for a particular algorithm are before suggesting
> > that their solution is "robust". :)
>
> Yes, but an understanding to the extend required by the
> argument should be sufficient :-)

And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all....

> > > # seq 29
> > > 
> > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> > 
> > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > 
> > .....
> > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > 
> > It takes the best part of *50 seconds* before the push work actually
> > runs?
> > 
> > That's .... well and truly screwed up - the work should run on that
> > CPU on the very next time it yeilds the CPU. If we're holding the
> > CPU without yeilding it for that long, hangcheck and RCU warnings
> > should be going off...
> 
> No such warnings.
> 
> But the load is probably I/O bound to the log:
> 
> - creates `cp -a` copies of a directory tree with small files (linux source repository)
> - source tree probably completely cached.
> - two copies in parallel
> - slow log (on software raid6)
> 
> Isn't it to be expected that sooner or later you need to wait for
> log writes when you write as fast as possible with lots of
> metadata updates and not so much data?

No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand....

Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...

> I'm a bit concerned, though, that there seem to be a rather
> unlimited (~ 1000) number of kernel worker threads waiting for the
> cil push and indirectly for log writes.

That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.

> > So it dropped by 16 bytes (seems to be common) which is
> > unexpected.  I wonder if it filled a hole in a buffer and so
> > needed one less xlog_op_header()? But then the size would have
> > gone up by at least 128 bytes for the hole that was filled, so
> > it still shouldn't go down in size.
> > 
> > I think you need to instrument xlog_cil_insert_items() and catch
> > a negative length here:
> > 
> > 	/* account for space used by new iovec headers  */
> > 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > 	iovhdr_res; ctx->nvecs += diff_iovecs;
> > 
> > (diff_iovecs will be negative if the number of xlog_op_header
> > structures goes down)
> > 
> > And if this happens, then dump the transaction ticket via
> > xlog_print_trans(tp) so we can see all the log items types and
> > vectors that the transaction has formatted...
> 
> I tried that, but the output was difficult to understand, because
> at that point you can only log the complete transaction with the
> items already updated.  And a shrinking item is not switched to the
> shadow vector, so the formatted content is already overwritten and
> not available for analysis.

Yes, that's exactly the information I need to see.

But the fact you think this is something I don't need to know about
is classic Dunning-Kruger in action. You don't understand why I
asked for this information, and found the information "difficult to
understand", so you decided I didn't need it either, despite the
fact I asked explicitly for it.

What I first need to know is what operations are being performed by
the transaciton that shrunk and what all the items in it are, not
which specific items shrunk and by how much. There can be tens to
hundreds of items in a single transaction, and it's the combination
of the transaction state, the reservation, the amount of the
reservation that has been consumed, what items are consuming that
reservation, etc. that I need to first see and analyse.

I don't ask for specific information just for fun - I ask for
specific information because it is *necessary*. If you want the
problem triaged and fixed, then please supply the information you
are asked for, even if you don't understand why or what it means.

> So I've added some code to dump an item in its old and its new
> state in xlog_cil_insert_format_items when either the requested
> buffer len or the number of vectors decreased.

Yes, that would be useful when trying to understand the low level
details of where a specific operation might be doing something
unexpected, but I don't have a high level picture of what operations
are triggering this issue so have no clue about the context in which
these isolated, context free changes are occuring.

That said ....

> Several examples of different kind with a little bit of manual annotation following
> 
> Best
>   Donald
> 
> - XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes
> 
> [   29.606212] XFS (sda1): XXX required buf size 184 -> 172
> [   29.612591] XFS (sda1): XXX niovecs           3 -> 3
> 
> [   29.618570] XFS (sda1): XXX old log item:
> [   29.623469] XFS (sda1): log item:
> [   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
> [   29.632375] XFS (sda1):   flags      = 0x8
> [   29.636858] XFS (sda1):   niovecs    = 3
> [   29.647442] XFS (sda1):   size       = 312
> [   29.651814] XFS (sda1):   bytes      = 184
> [   29.656278] XFS (sda1):   buf len    = 184
> [   29.660927] XFS (sda1):   iovec[0]
> [   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
> [   29.669592] XFS (sda1):     len      = 56
> [   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
> [   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
> [   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   29.698633] XFS (sda1):   iovec[1]
> [   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
> [   29.707263] XFS (sda1):     len      = 96
> [   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
> [   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   29.736241] XFS (sda1):   iovec[2]
> [   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
> [   29.744873] XFS (sda1):     len      = 32
> [   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
> [   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
> [   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X
> 
> [   29.773900] XFS (sda1): XXX new log item:
> [   29.778718] XFS (sda1): log item:
> [   29.782856] XFS (sda1):   type       = 0x123b
> [   29.787478] XFS (sda1):   flags      = 0x8
> [   29.791902] XFS (sda1):   niovecs    = 3
> [   29.796321] XFS (sda1):   size       = 312
> [   29.800640] XFS (sda1):   bytes      = 172
> [   29.805052] XFS (sda1):   buf len    = 176
> [   29.809659] XFS (sda1):   iovec[0]
> [   29.813781] XFS (sda1):     type     = 0x5
> [   29.818289] XFS (sda1):     len      = 56
> [   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
> [   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
> [   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   29.847292] XFS (sda1):   iovec[1]
> [   29.851420] XFS (sda1):     type     = 0x6
> [   29.855933] XFS (sda1):     len      = 96
> [   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
> [   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   29.884949] XFS (sda1):   iovec[2]
> [   29.889081] XFS (sda1):     type     = 0x9
> [   29.893601] XFS (sda1):     len      = 20
> [   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
> [   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
> [   29.913381] 00000010: 58 92 ef 04                                      X...

That may be removal of a directory entry from a local form
directory. Hmmm - local form inode forks only log the active byte
range so that could be the issue here.

Hold on - where did this come from? "cp -a" doesn't result in file
and directory removals, right?

> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes
> 
> [   29.982907] XFS (sda1): XXX required buf size 172 -> 160
> [   29.992716] XFS (sda1): XXX niovecs           3 -> 3
> 
> [   29.998728] XFS (sda1): XXX old log item:
> [   30.003624] XFS (sda1): log item:
> [   30.007835] XFS (sda1):   type       = 0x123b
> [   30.012654] XFS (sda1):   flags      = 0x8
> [   30.017145] XFS (sda1):   niovecs    = 3
> [   30.021638] XFS (sda1):   size       = 312
> [   30.026012] XFS (sda1):   bytes      = 172
> [   30.030610] XFS (sda1):   buf len    = 176
> [   30.035292] XFS (sda1):   iovec[0]
> [   30.039480] XFS (sda1):     type     = 0x5
> [   30.044054] XFS (sda1):     len      = 56
> [   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
> [   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.073545] XFS (sda1):   iovec[1]
> [   30.077744] XFS (sda1):     type     = 0x6
> [   30.082455] XFS (sda1):     len      = 96
> [   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.111801] XFS (sda1):   iovec[2]
> [   30.115989] XFS (sda1):     type     = 0x9
> [   30.120715] XFS (sda1):     len      = 20
> [   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
> [   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
> [   30.140808] 00000010: 58 92 ef 04                                      X...
> 
> [   30.149006] XFS (sda1): XXX new log item:
> [   30.154039] XFS (sda1): log item:
> [   30.154039] XFS (sda1):   type       = 0x123b
> [   30.154041] XFS (sda1):   flags      = 0x8
> [   30.167436] XFS (sda1):   niovecs    = 3
> [   30.167437] XFS (sda1):   size       = 312
> [   30.167438] XFS (sda1):   bytes      = 160
> [   30.180881] XFS (sda1):   buf len    = 160
> [   30.180882] XFS (sda1):   iovec[0]
> [   30.180882] XFS (sda1):     type     = 0x5
> [   30.180883] XFS (sda1):     len      = 56
> [   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
> [   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.180886] XFS (sda1):   iovec[1]
> [   30.180886] XFS (sda1):     type     = 0x6
> [   30.180887] XFS (sda1):     len      = 96
> [   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.180889] XFS (sda1):   iovec[2]
> [   30.180890] XFS (sda1):     type     = 0x9
> [   30.180890] XFS (sda1):     len      = 8
> [   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
> [   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

And that looks to be another unlink in the same directory inode, now
empty.

> 
> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2
> 
> [   30.197403] XFS (sda1): XXX required buf size 160 -> 152
> [   30.296091] XFS (sda1): XXX niovecs           3 -> 2
> 
> [   30.296092] XFS (sda1): XXX old log item:
> [   30.296093] XFS (sda1): log item:
> [   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
> [   30.317524] XFS (sda1):   type       = 0x123b
> [   30.317524] XFS (sda1):   flags      = 0x8
> [   30.317525] XFS (sda1):   niovecs    = 3
> [   30.317525] XFS (sda1):   size       = 304
> [   30.317526] XFS (sda1):   bytes      = 160
> [   30.317526] XFS (sda1):   buf len    = 160
> [   30.317527] XFS (sda1):   iovec[0]
> [   30.317527] XFS (sda1):     type     = 0x5
> [   30.317528] XFS (sda1):     len      = 56
> [   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
> [   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.317531] XFS (sda1):   iovec[1]
> [   30.317531] XFS (sda1):     type     = 0x6
> [   30.317531] XFS (sda1):     len      = 96
> [   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.317533] XFS (sda1):   iovec[2]
> [   30.317534] XFS (sda1):     type     = 0x9
> [   30.317534] XFS (sda1):     len      = 8
> [   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
> [   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
> 
> [   30.317536] XFS (sda1): XXX new log item:
> [   30.317537] XFS (sda1): log item:
> [   30.317537] XFS (sda1):   type       = 0x123b
> [   30.317538] XFS (sda1):   flags      = 0x8
> [   30.317539] XFS (sda1):   niovecs    = 2
> [   30.317539] XFS (sda1):   size       = 304
> [   30.317540] XFS (sda1):   bytes      = 152
> [   30.317540] XFS (sda1):   buf len    = 152
> [   30.317541] XFS (sda1):   iovec[0]
> [   30.317541] XFS (sda1):     type     = 0x5
> [   30.317542] XFS (sda1):     len      = 56
> [   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
> [   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.317544] XFS (sda1):   iovec[1]
> [   30.317544] XFS (sda1):     type     = 0x6
> [   30.317545] XFS (sda1):     len      = 96
> [   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
> [   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................

And maybe that's an rmdir() triggering an unlink on the same - now
empty - directory.

Does your workload also do 'rm -rf' anywhere because I can't explain
how a concurrent 'cp -a' workload would trigger this behaviour...

> - XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2
> 
> [   37.983756] XFS (sda1): XXX required buf size 168 -> 152
> [   37.990253] XFS (sda1): XXX niovecs           3 -> 2
> 
> [   37.996202] XFS (sda1): XXX old log item:
> [   38.001061] XFS (sda1): log item:
> [   38.005239] XFS (sda1):   type       = 0x123b
> [   38.009885] XFS (sda1):   flags      = 0x9
> [   38.014330] XFS (sda1):   niovecs    = 3
> [   38.018764] XFS (sda1):   size       = 440
> [   38.023100] XFS (sda1):   bytes      = 168
> [   38.027533] XFS (sda1):   buf len    = 168
> [   38.032157] XFS (sda1):   iovec[0]
> [   38.036286] XFS (sda1):     type     = 0x5
> [   38.040796] XFS (sda1):     len      = 56
> [   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
> [   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
> [   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [   38.069853] XFS (sda1):   iovec[1]
> [   38.073989] XFS (sda1):     type     = 0x6
> [   38.078525] XFS (sda1):     len      = 96
> [   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
> [   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
> [   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
> [   38.107611] XFS (sda1):   iovec[2]
> [   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
> [   38.116285] XFS (sda1):     len      = 16
> [   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
> [   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...
> 
> [   38.136054] XFS (sda1): XXX new log item:
> [   38.140878] XFS (sda1): log item:
> [   38.145025] XFS (sda1):   type       = 0x123b
> [   38.149645] XFS (sda1):   flags      = 0x9
> [   38.154067] XFS (sda1):   niovecs    = 2
> [   38.158490] XFS (sda1):   size       = 440
> [   38.162799] XFS (sda1):   bytes      = 152
> [   38.167202] XFS (sda1):   buf len    = 152
> [   38.171801] XFS (sda1):   iovec[0]
> [   38.175911] XFS (sda1):     type     = 0x5
> [   38.180409] XFS (sda1):     len      = 56
> [   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
> [   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
> [   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [   38.209373] XFS (sda1):   iovec[1]
> [   38.213488] XFS (sda1):     type     = 0x6
> [   38.217990] XFS (sda1):     len      = 96
> [   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
> [   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
> [   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.

truncate() or directory entry removal resulting in block to
shortform conversion, maybe? Implies 'rm -rf', too.

> 
> - XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2
> 
> [ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
> [ 2120.037257] XFS (md0): XXX niovecs           3 -> 2

Logged buffer size went up because a hole was filled. Not a
candidate. Was a directory buffer.

> 
> [ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
> [ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1
> 
> [ 2279.741360] XFS (sda1): XXX old log item:
> [ 2279.746199] XFS (sda1): log item:
> [ 2279.750512] XFS (sda1):   type       = 0x123c
> [ 2279.755181] XFS (sda1):   flags      = 0x8
> [ 2279.759644] XFS (sda1):   niovecs    = 2
> [ 2279.764246] XFS (sda1):   size       = 256
> [ 2279.768625] XFS (sda1):   bytes      = 152
> [ 2279.773094] XFS (sda1):   buf len    = 152
> [ 2279.777741] XFS (sda1):   iovec[0]
> [ 2279.782044] XFS (sda1):     type     = 0x1
> [ 2279.786607] XFS (sda1):     len      = 24
> [ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
> [ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
> [ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
> [ 2279.815287] XFS (sda1):   iovec[1]
> [ 2279.819604] XFS (sda1):     type     = 0x2
> [ 2279.824408] XFS (sda1):     len      = 128
> [ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
> [ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
> [ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................

Hold on - BMAP?

But the directory buffer above was a XDB3, and so a matching BMAP
btree buffer should have a BMA3 signature. So this is output from
two different filesystems, one formatted with crc=0, the other with
crc=1....

Oh, wait, now that I look at it, all the inodes in the output are v2
inodes. Yet your testing is supposed to be running on CRC+reflink
enabled filesystems, which use v3 inodes. So none of these traces
come from the filesystem under your 'cp -a' workload, right?

Fmeh, it's right there - "XFS (sda1)...". These are all traces from
your root directory, not md0/md1 which are your RAID6 devices
running the 'cp -a' workload. The only trace from md0 was the hole
filling buffer which grew the size of the log vector.

Ok, can you plese filter the output to limit it to just the RAID6
filesystem under test so we get the transaction dumps just before it
hangs? If you want to, create a separate dump output from the root
device to capture the things it is doing, but the first thing is to
isolate the 'cp -a' hang vector...

Cheers,

Dave.

Donald Buczek Jan. 3, 2021, 4:03 p.m. UTC | #6

On 02.01.21 23:44, Dave Chinner wrote:
> On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
>> On 31.12.20 22:59, Dave Chinner wrote:
>>> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>>>> On 30.12.20 23:16, Dave Chinner wrote:
> 
>>> One could argue that, but one should also understand the design
>>> constraints for a particular algorithm are before suggesting
>>> that their solution is "robust". :)
>>
>> Yes, but an understanding to the extend required by the
>> argument should be sufficient :-)
> 
> And that's the fundamental conceit described by Dunning-Kruger.
> 
> I.e. someone thinks they know enough to understand the argument,
> when in fact they don't understand enough about the subject matter
> to realise they don't understand what the expert at the other end is
> saying at all....
> 
>>>> # seq 29
>>>>
>>>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
>>>
>>> So, at 20:08:15 we get a push trigger and the work is queued. But...
>>>
>>> .....
>>>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
>>>
>>> It takes the best part of *50 seconds* before the push work actually
>>> runs?
>>>
>>> That's .... well and truly screwed up - the work should run on that
>>> CPU on the very next time it yeilds the CPU. If we're holding the
>>> CPU without yeilding it for that long, hangcheck and RCU warnings
>>> should be going off...
>>
>> No such warnings.
>>
>> But the load is probably I/O bound to the log:
>>
>> - creates `cp -a` copies of a directory tree with small files (linux source repository)
>> - source tree probably completely cached.
>> - two copies in parallel
>> - slow log (on software raid6)
>>
>> Isn't it to be expected that sooner or later you need to wait for
>> log writes when you write as fast as possible with lots of
>> metadata updates and not so much data?
> 
> No, incoming transactions waits for transaction reservation space,
> not log writes. Reservation space is freed up by metadata writeback.
> So if you have new transactions blocking in xfs_trans_reserve(),
> then we are blocking on metadata writeback.
> 
> The CIL worker thread may be waiting for log IO completion before it
> issues more log IO, and in that case it is waiting on iclog buffer
> space (i.e. only waiting on internal log state, not metadata
> writeback).  Can you find that kworker thread stack and paste it? If
> bound on log IO, it will be blocked in xlog_get_iclog_space().
> 
> Please paste the entire stack output, too, not just the bits you
> think are relevant or understand....

That would be a kworker on the "xfs-cil/%s" workqueue, correct?
And you mean before the lockup, when the I/O is still active, correct?

This is the usual stack output from that thread:

# # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
# cat /proc/2080/task/2080/stack

[<0>] md_flush_request+0x87/0x190
[<0>] raid5_make_request+0x61b/0xb30
[<0>] md_handle_request+0x127/0x1a0
[<0>] md_submit_bio+0xbd/0x100
[<0>] submit_bio_noacct+0x151/0x410
[<0>] submit_bio+0x4b/0x1a0
[<0>] xlog_state_release_iclog+0x87/0xb0
[<0>] xlog_write+0x452/0x6d0
[<0>] xlog_cil_push_work+0x2e0/0x4d0
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x23f/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

sampled three times with a few seconds in between, stack identical.

> Also, cp -a of a linux source tree is just as data intensive as it
> is metadata intensive. There's probably more data IO than metadata
> IO, so that's more likely to be what is slowing the disks down as
> metadata writeback is...
> 
>> I'm a bit concerned, though, that there seem to be a rather
>> unlimited (~ 1000) number of kernel worker threads waiting for the
>> cil push and indirectly for log writes.
> 
> That's normal - XFS is highly asynchronous and defers a lot of work
> to completion threads.
> 
>>> So it dropped by 16 bytes (seems to be common) which is
>>> unexpected.  I wonder if it filled a hole in a buffer and so
>>> needed one less xlog_op_header()? But then the size would have
>>> gone up by at least 128 bytes for the hole that was filled, so
>>> it still shouldn't go down in size.
>>>
>>> I think you need to instrument xlog_cil_insert_items() and catch
>>> a negative length here:
>>>
>>> 	/* account for space used by new iovec headers  */
>>> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
>>> 	iovhdr_res; ctx->nvecs += diff_iovecs;
>>>
>>> (diff_iovecs will be negative if the number of xlog_op_header
>>> structures goes down)
>>>
>>> And if this happens, then dump the transaction ticket via
>>> xlog_print_trans(tp) so we can see all the log items types and
>>> vectors that the transaction has formatted...
>>
>> I tried that, but the output was difficult to understand, because
>> at that point you can only log the complete transaction with the
>> items already updated.  And a shrinking item is not switched to the
>> shadow vector, so the formatted content is already overwritten and
>> not available for analysis.
> 
> Yes, that's exactly the information I need to see.
> 
> But the fact you think this is something I don't need to know about
> is classic Dunning-Kruger in action. You don't understand why I
> asked for this information, and found the information "difficult to
> understand", so you decided I didn't need it either, despite the
> fact I asked explicitly for it.
> 
> What I first need to know is what operations are being performed by
> the transaciton that shrunk and what all the items in it are, not
> which specific items shrunk and by how much. There can be tens to
> hundreds of items in a single transaction, and it's the combination
> of the transaction state, the reservation, the amount of the
> reservation that has been consumed, what items are consuming that
> reservation, etc. that I need to first see and analyse.
> 
> I don't ask for specific information just for fun - I ask for
> specific information because it is *necessary*. If you want the
> problem triaged and fixed, then please supply the information you
> are asked for, even if you don't understand why or what it means.

I see. I hope, you can make use of the following.

Attempt 1 ( base: Linux v5.1.10 + smartpqi driver 2.1.6-005)

+++ b/fs/xfs/xfs_log_cil.c
@@ -405,6 +405,11 @@ xlog_cil_insert_items(
  
         spin_lock(&cil->xc_cil_lock);
  
+       if (diff_iovecs < 0 && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
+               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d", diff_iovecs);
+               xlog_print_trans(tp);
+       }
+
         /* account for space used by new iovec headers  */
         iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
         len += iovhdr_res;

Result: No log output before filesystem locked up after 15 iterations of the copy loop.

Attempt 2 ( base: previous patch):

--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -405,8 +405,8 @@ xlog_cil_insert_items(
  
         spin_lock(&cil->xc_cil_lock);
  
-       if (diff_iovecs < 0 && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
-               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d", diff_iovecs);
+       if ((diff_iovecs < 0 || len < 0) && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
+               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d diff_len %d", diff_iovecs, len);
                 xlog_print_trans(tp);
         }

Result: 594 rather similar events logged before filesystem locked up after 29 iterations of the copy loop

The third iovec sometimes has len=48 instead of len=16.

Last two events before the lockup:

2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.

2021-01-03T14:48:43.300880+01:00 deadbird kernel: [ 3195.033706] XFS (md0): XXX diff_iovecs 0 diff_len -16
2021-01-03T14:48:43.311716+01:00 deadbird kernel: [ 3195.039849] XFS (md0): transaction summary:
2021-01-03T14:48:43.311732+01:00 deadbird kernel: [ 3195.044919] XFS (md0):   log res   = 212728
2021-01-03T14:48:43.316784+01:00 deadbird kernel: [ 3195.049982] XFS (md0):   log count = 8
2021-01-03T14:48:43.321455+01:00 deadbird kernel: [ 3195.054663] XFS (md0):   flags     = 0x25
2021-01-03T14:48:43.326316+01:00 deadbird kernel: [ 3195.059519] XFS (md0): ticket reservation summary:
2021-01-03T14:48:43.332050+01:00 deadbird kernel: [ 3195.065247] XFS (md0):   unit res    = 225140 bytes
2021-01-03T14:48:43.337873+01:00 deadbird kernel: [ 3195.071065] XFS (md0):   current res = 225140 bytes
2021-01-03T14:48:43.343762+01:00 deadbird kernel: [ 3195.076967] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
2021-01-03T14:48:43.350951+01:00 deadbird kernel: [ 3195.084148] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
2021-01-03T14:48:43.358036+01:00 deadbird kernel: [ 3195.091238] XFS (md0):   ophdr + reg = 0 bytes
2021-01-03T14:48:43.363476+01:00 deadbird kernel: [ 3195.096671] XFS (md0):   num regions = 0
2021-01-03T14:48:43.368213+01:00 deadbird kernel: [ 3195.101416] XFS (md0): log item:
2021-01-03T14:48:43.372263+01:00 deadbird kernel: [ 3195.105465] XFS (md0):   type      = 0x123b
2021-01-03T14:48:43.376895+01:00 deadbird kernel: [ 3195.110101] XFS (md0):   flags     = 0x8
2021-01-03T14:48:43.381240+01:00 deadbird kernel: [ 3195.114443] XFS (md0):   niovecs   = 3
2021-01-03T14:48:43.385587+01:00 deadbird kernel: [ 3195.118788] XFS (md0):   size      = 696
2021-01-03T14:48:43.389830+01:00 deadbird kernel: [ 3195.123026] XFS (md0):   bytes     = 248
2021-01-03T14:48:43.394355+01:00 deadbird kernel: [ 3195.127469] XFS (md0):   buf len   = 248
2021-01-03T14:48:43.398909+01:00 deadbird kernel: [ 3195.132100] XFS (md0):   iovec[0]
2021-01-03T14:48:43.402967+01:00 deadbird kernel: [ 3195.136147] XFS (md0):     type    = 0x5
2021-01-03T14:48:43.407367+01:00 deadbird kernel: [ 3195.140574] XFS (md0):     len     = 56
2021-01-03T14:48:43.411704+01:00 deadbird kernel: [ 3195.144903] XFS (md0):     first 32 bytes of iovec[0]:
2021-01-03T14:48:43.417812+01:00 deadbird kernel: [ 3195.151002] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
2021-01-03T14:48:43.427114+01:00 deadbird kernel: [ 3195.160302] 00000010: 57 b1 d6 80 2b 00 00 00 00 00 00 00 00 00 00 00  W...+...........
2021-01-03T14:48:43.440572+01:00 deadbird kernel: [ 3195.169704] XFS (md0):   iovec[1]
2021-01-03T14:48:43.440581+01:00 deadbird kernel: [ 3195.173774] XFS (md0):     type    = 0x6
2021-01-03T14:48:43.449478+01:00 deadbird kernel: [ 3195.178225] XFS (md0):     len     = 176
2021-01-03T14:48:43.449481+01:00 deadbird kernel: [ 3195.182687] XFS (md0):     first 32 bytes of iovec[1]:
2021-01-03T14:48:43.464908+01:00 deadbird kernel: [ 3195.188797] 00000000: 4e 49 20 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI .....}...}...
2021-01-03T14:48:43.464910+01:00 deadbird kernel: [ 3195.198116] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2021-01-03T14:48:43.478436+01:00 deadbird kernel: [ 3195.207555] XFS (md0):   iovec[2]
2021-01-03T14:48:43.478438+01:00 deadbird kernel: [ 3195.211645] XFS (md0):     type    = 0x7
2021-01-03T14:48:43.487300+01:00 deadbird kernel: [ 3195.216229] XFS (md0):     len     = 16
2021-01-03T14:48:43.487303+01:00 deadbird kernel: [ 3195.220509] XFS (md0):     first 16 bytes of iovec[2]:
2021-01-03T14:48:43.502751+01:00 deadbird kernel: [ 3195.226631] 00000000: 00 00 00 00 00 00 00 00 00 ae 04 29 00 00 da 24  ...........)...$

Full output available at https://owww.molgen.mpg.de/~buczek/2021-01-03_13-09/log.txt (2.2M)

>> So I've added some code to dump an item in its old and its new
>> state in xlog_cil_insert_format_items when either the requested
>> buffer len or the number of vectors decreased.
> 
> Yes, that would be useful when trying to understand the low level
> details of where a specific operation might be doing something
> unexpected, but I don't have a high level picture of what operations
> are triggering this issue so have no clue about the context in which
> these isolated, context free changes are occuring.
> 
> That said ....
> 
>> Several examples of different kind with a little bit of manual annotation following
>>
>> Best
>>    Donald
>>
>> - XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes
>>
>> [   29.606212] XFS (sda1): XXX required buf size 184 -> 172
>> [   29.612591] XFS (sda1): XXX niovecs           3 -> 3
>>
>> [   29.618570] XFS (sda1): XXX old log item:
>> [   29.623469] XFS (sda1): log item:
>> [   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
>> [   29.632375] XFS (sda1):   flags      = 0x8
>> [   29.636858] XFS (sda1):   niovecs    = 3
>> [   29.647442] XFS (sda1):   size       = 312
>> [   29.651814] XFS (sda1):   bytes      = 184
>> [   29.656278] XFS (sda1):   buf len    = 184
>> [   29.660927] XFS (sda1):   iovec[0]
>> [   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
>> [   29.669592] XFS (sda1):     len      = 56
>> [   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
>> [   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   29.698633] XFS (sda1):   iovec[1]
>> [   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
>> [   29.707263] XFS (sda1):     len      = 96
>> [   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   29.736241] XFS (sda1):   iovec[2]
>> [   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
>> [   29.744873] XFS (sda1):     len      = 32
>> [   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
>> [   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
>> [   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X
>>
>> [   29.773900] XFS (sda1): XXX new log item:
>> [   29.778718] XFS (sda1): log item:
>> [   29.782856] XFS (sda1):   type       = 0x123b
>> [   29.787478] XFS (sda1):   flags      = 0x8
>> [   29.791902] XFS (sda1):   niovecs    = 3
>> [   29.796321] XFS (sda1):   size       = 312
>> [   29.800640] XFS (sda1):   bytes      = 172
>> [   29.805052] XFS (sda1):   buf len    = 176
>> [   29.809659] XFS (sda1):   iovec[0]
>> [   29.813781] XFS (sda1):     type     = 0x5
>> [   29.818289] XFS (sda1):     len      = 56
>> [   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
>> [   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   29.847292] XFS (sda1):   iovec[1]
>> [   29.851420] XFS (sda1):     type     = 0x6
>> [   29.855933] XFS (sda1):     len      = 96
>> [   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   29.884949] XFS (sda1):   iovec[2]
>> [   29.889081] XFS (sda1):     type     = 0x9
>> [   29.893601] XFS (sda1):     len      = 20
>> [   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
>> [   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
>> [   29.913381] 00000010: 58 92 ef 04                                      X...
> 
> That may be removal of a directory entry from a local form
> directory. Hmmm - local form inode forks only log the active byte
> range so that could be the issue here.
> 
> Hold on - where did this come from? "cp -a" doesn't result in file
> and directory removals, right?

Correct. In the above part of the mail, where I described the load, that was reffering to your question "It takes the best part of *50 seconds* before the push work actually runs?". So this was related to the specific test from which the quoted loglines were.

The second part of the email was just about the question, whether it can be demonstrated that items in the CIL shrink. This was the followup to "Do you have tracing showing the operation where the relogged item has actually gotten smaller?" and "But, of course, if the size is not expected to decrease, this discrepancy must be clarified anyway. I am sure, I can find out the exact circumstances, this does happen. I will follow up on this. "

As soon as I added that logging, I got events from the system disk ("sda1") during boot activity. This already seemed to answer the question whether log items shrink (rightfully or wrongfully). So I didn't even start the described workload on the md devices. I should have made this difference clear and apologize that you needed to spend time to figure that out.

>> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes
>>
>> [   29.982907] XFS (sda1): XXX required buf size 172 -> 160
>> [   29.992716] XFS (sda1): XXX niovecs           3 -> 3
>>
>> [   29.998728] XFS (sda1): XXX old log item:
>> [   30.003624] XFS (sda1): log item:
>> [   30.007835] XFS (sda1):   type       = 0x123b
>> [   30.012654] XFS (sda1):   flags      = 0x8
>> [   30.017145] XFS (sda1):   niovecs    = 3
>> [   30.021638] XFS (sda1):   size       = 312
>> [   30.026012] XFS (sda1):   bytes      = 172
>> [   30.030610] XFS (sda1):   buf len    = 176
>> [   30.035292] XFS (sda1):   iovec[0]
>> [   30.039480] XFS (sda1):     type     = 0x5
>> [   30.044054] XFS (sda1):     len      = 56
>> [   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
>> [   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.073545] XFS (sda1):   iovec[1]
>> [   30.077744] XFS (sda1):     type     = 0x6
>> [   30.082455] XFS (sda1):     len      = 96
>> [   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.111801] XFS (sda1):   iovec[2]
>> [   30.115989] XFS (sda1):     type     = 0x9
>> [   30.120715] XFS (sda1):     len      = 20
>> [   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
>> [   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
>> [   30.140808] 00000010: 58 92 ef 04                                      X...
>>
>> [   30.149006] XFS (sda1): XXX new log item:
>> [   30.154039] XFS (sda1): log item:
>> [   30.154039] XFS (sda1):   type       = 0x123b
>> [   30.154041] XFS (sda1):   flags      = 0x8
>> [   30.167436] XFS (sda1):   niovecs    = 3
>> [   30.167437] XFS (sda1):   size       = 312
>> [   30.167438] XFS (sda1):   bytes      = 160
>> [   30.180881] XFS (sda1):   buf len    = 160
>> [   30.180882] XFS (sda1):   iovec[0]
>> [   30.180882] XFS (sda1):     type     = 0x5
>> [   30.180883] XFS (sda1):     len      = 56
>> [   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
>> [   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.180886] XFS (sda1):   iovec[1]
>> [   30.180886] XFS (sda1):     type     = 0x6
>> [   30.180887] XFS (sda1):     len      = 96
>> [   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.180889] XFS (sda1):   iovec[2]
>> [   30.180890] XFS (sda1):     type     = 0x9
>> [   30.180890] XFS (sda1):     len      = 8
>> [   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
>> [   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
> 
> And that looks to be another unlink in the same directory inode, now
> empty.
> 
>>
>> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2
>>
>> [   30.197403] XFS (sda1): XXX required buf size 160 -> 152
>> [   30.296091] XFS (sda1): XXX niovecs           3 -> 2
>>
>> [   30.296092] XFS (sda1): XXX old log item:
>> [   30.296093] XFS (sda1): log item:
>> [   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
>> [   30.317524] XFS (sda1):   type       = 0x123b
>> [   30.317524] XFS (sda1):   flags      = 0x8
>> [   30.317525] XFS (sda1):   niovecs    = 3
>> [   30.317525] XFS (sda1):   size       = 304
>> [   30.317526] XFS (sda1):   bytes      = 160
>> [   30.317526] XFS (sda1):   buf len    = 160
>> [   30.317527] XFS (sda1):   iovec[0]
>> [   30.317527] XFS (sda1):     type     = 0x5
>> [   30.317528] XFS (sda1):     len      = 56
>> [   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
>> [   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.317531] XFS (sda1):   iovec[1]
>> [   30.317531] XFS (sda1):     type     = 0x6
>> [   30.317531] XFS (sda1):     len      = 96
>> [   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.317533] XFS (sda1):   iovec[2]
>> [   30.317534] XFS (sda1):     type     = 0x9
>> [   30.317534] XFS (sda1):     len      = 8
>> [   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
>> [   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
>>
>> [   30.317536] XFS (sda1): XXX new log item:
>> [   30.317537] XFS (sda1): log item:
>> [   30.317537] XFS (sda1):   type       = 0x123b
>> [   30.317538] XFS (sda1):   flags      = 0x8
>> [   30.317539] XFS (sda1):   niovecs    = 2
>> [   30.317539] XFS (sda1):   size       = 304
>> [   30.317540] XFS (sda1):   bytes      = 152
>> [   30.317540] XFS (sda1):   buf len    = 152
>> [   30.317541] XFS (sda1):   iovec[0]
>> [   30.317541] XFS (sda1):     type     = 0x5
>> [   30.317542] XFS (sda1):     len      = 56
>> [   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
>> [   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.317544] XFS (sda1):   iovec[1]
>> [   30.317544] XFS (sda1):     type     = 0x6
>> [   30.317545] XFS (sda1):     len      = 96
>> [   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
>> [   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> 
> And maybe that's an rmdir() triggering an unlink on the same - now
> empty - directory.
> 
> Does your workload also do 'rm -rf' anywhere because I can't explain
> how a concurrent 'cp -a' workload would trigger this behaviour...
> 
>> - XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2
>>
>> [   37.983756] XFS (sda1): XXX required buf size 168 -> 152
>> [   37.990253] XFS (sda1): XXX niovecs           3 -> 2
>>
>> [   37.996202] XFS (sda1): XXX old log item:
>> [   38.001061] XFS (sda1): log item:
>> [   38.005239] XFS (sda1):   type       = 0x123b
>> [   38.009885] XFS (sda1):   flags      = 0x9
>> [   38.014330] XFS (sda1):   niovecs    = 3
>> [   38.018764] XFS (sda1):   size       = 440
>> [   38.023100] XFS (sda1):   bytes      = 168
>> [   38.027533] XFS (sda1):   buf len    = 168
>> [   38.032157] XFS (sda1):   iovec[0]
>> [   38.036286] XFS (sda1):     type     = 0x5
>> [   38.040796] XFS (sda1):     len      = 56
>> [   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
>> [   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> [   38.069853] XFS (sda1):   iovec[1]
>> [   38.073989] XFS (sda1):     type     = 0x6
>> [   38.078525] XFS (sda1):     len      = 96
>> [   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
>> [   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
>> [   38.107611] XFS (sda1):   iovec[2]
>> [   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
>> [   38.116285] XFS (sda1):     len      = 16
>> [   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
>> [   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...
>>
>> [   38.136054] XFS (sda1): XXX new log item:
>> [   38.140878] XFS (sda1): log item:
>> [   38.145025] XFS (sda1):   type       = 0x123b
>> [   38.149645] XFS (sda1):   flags      = 0x9
>> [   38.154067] XFS (sda1):   niovecs    = 2
>> [   38.158490] XFS (sda1):   size       = 440
>> [   38.162799] XFS (sda1):   bytes      = 152
>> [   38.167202] XFS (sda1):   buf len    = 152
>> [   38.171801] XFS (sda1):   iovec[0]
>> [   38.175911] XFS (sda1):     type     = 0x5
>> [   38.180409] XFS (sda1):     len      = 56
>> [   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
>> [   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> [   38.209373] XFS (sda1):   iovec[1]
>> [   38.213488] XFS (sda1):     type     = 0x6
>> [   38.217990] XFS (sda1):     len      = 96
>> [   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
>> [   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
> 
> truncate() or directory entry removal resulting in block to
> shortform conversion, maybe? Implies 'rm -rf', too.
> 
>>
>> - XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2
>>
>> [ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
>> [ 2120.037257] XFS (md0): XXX niovecs           3 -> 2
> 
> Logged buffer size went up because a hole was filled. Not a
> candidate. Was a directory buffer.
> 
>>
>> [ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
>> [ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1
>>
>> [ 2279.741360] XFS (sda1): XXX old log item:
>> [ 2279.746199] XFS (sda1): log item:
>> [ 2279.750512] XFS (sda1):   type       = 0x123c
>> [ 2279.755181] XFS (sda1):   flags      = 0x8
>> [ 2279.759644] XFS (sda1):   niovecs    = 2
>> [ 2279.764246] XFS (sda1):   size       = 256
>> [ 2279.768625] XFS (sda1):   bytes      = 152
>> [ 2279.773094] XFS (sda1):   buf len    = 152
>> [ 2279.777741] XFS (sda1):   iovec[0]
>> [ 2279.782044] XFS (sda1):     type     = 0x1
>> [ 2279.786607] XFS (sda1):     len      = 24
>> [ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
>> [ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
>> [ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
>> [ 2279.815287] XFS (sda1):   iovec[1]
>> [ 2279.819604] XFS (sda1):     type     = 0x2
>> [ 2279.824408] XFS (sda1):     len      = 128
>> [ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
>> [ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
>> [ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................
> 
> Hold on - BMAP?
> 
> But the directory buffer above was a XDB3, and so a matching BMAP
> btree buffer should have a BMA3 signature. So this is output from
> two different filesystems, one formatted with crc=0, the other with
> crc=1....
> 
> Oh, wait, now that I look at it, all the inodes in the output are v2
> inodes. Yet your testing is supposed to be running on CRC+reflink
> enabled filesystems, which use v3 inodes. So none of these traces
> come from the filesystem under your 'cp -a' workload, right?
> 
> Fmeh, it's right there - "XFS (sda1)...". These are all traces from
> your root directory, not md0/md1 which are your RAID6 devices
> running the 'cp -a' workload. The only trace from md0 was the hole
> filling buffer which grew the size of the log vector.
> 
> Ok, can you plese filter the output to limit it to just the RAID6
> filesystem under test so we get the transaction dumps just before it
> hangs? If you want to, create a separate dump output from the root
> device to capture the things it is doing, but the first thing is to
> isolate the 'cp -a' hang vector...
Okay, the (transaction-) logs for the `cp -a` load to the (empty) raid is above.

When this is understood, you probably also know, whether other transactions might still cause a problem and are of interest. So I suggest to postpone these for now.

Best

   Donald

> Cheers,
> 
> Dave.

Brian Foster Jan. 4, 2021, 4:23 p.m. UTC | #7

On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > Threads, which committed items to the CIL, wait in the xc_push_wait
> > waitqueue when used_space in the push context goes over a limit. These
> > threads need to be woken when the CIL is pushed.
> > 
> > The CIL push worker tries to avoid the overhead of calling wake_all()
> > when there are no waiters waiting. It does so by checking the same
> > condition which caused the waits to happen. This, however, is
> > unreliable, because ctx->space_used can actually decrease when items are
> > recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.
> 
> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...
> 

I agree that this warrants a bit more explanation and analysis before
changing the current code...

> > If the value goes below the limit while some threads are
> > already waiting but before the push worker gets to it, these threads are
> > not woken.
> > 
> > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > optimization. This is possible, because we hold the xc_push_lock
> > spinlock, which prevents additions to the waitqueue.
> > 
> > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > ---
> >  fs/xfs/xfs_log_cil.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b0ef071b3cb5..d620de8e217c 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> >  	/*
> >  	 * Wake up any background push waiters now this context is being pushed.
> >  	 */
> > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > +	if (waitqueue_active(&cil->xc_push_wait))
> >  		wake_up_all(&cil->xc_push_wait);
> 
> That just smells wrong to me. It *might* be correct, but this
> condition should pair with the sleep condition, as space used by a
> CIL context should never actually decrease....
> 

... but I'm a little confused by this assertion. The shadow buffer
allocation code refers to the possibility of shadow buffers falling out
that are smaller than currently allocated buffers. Further, the
_insert_format_items() code appears to explicitly optimize for this
possibility by reusing the active buffer, subtracting the old size/count
values from the diff variables and then reformatting the latest
(presumably smaller) item to the lv.

Of course this could just be implementation detail. I haven't dug into
the details in the remainder of this thread and I don't have specific
examples off the top of my head, but perhaps based on the ability of
various structures to change formats and the ability of log vectors to
shrink in size, shouldn't we expect the possibility of a CIL context to
shrink in size as well? Just from poking around the CIL it seems like
the surrounding code supports it (xlog_cil_insert_items() checks len > 0
for recalculating split res as well)...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Dave Chinner Jan. 7, 2021, 9:54 p.m. UTC | #8

On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > If the value goes below the limit while some threads are
> > > already waiting but before the push worker gets to it, these threads are
> > > not woken.
> > > 
> > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > optimization. This is possible, because we hold the xc_push_lock
> > > spinlock, which prevents additions to the waitqueue.
> > > 
> > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b0ef071b3cb5..d620de8e217c 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > >  	/*
> > >  	 * Wake up any background push waiters now this context is being pushed.
> > >  	 */
> > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > +	if (waitqueue_active(&cil->xc_push_wait))
> > >  		wake_up_all(&cil->xc_push_wait);
> > 
> > That just smells wrong to me. It *might* be correct, but this
> > condition should pair with the sleep condition, as space used by a
> > CIL context should never actually decrease....
> > 
> 
> ... but I'm a little confused by this assertion. The shadow buffer
> allocation code refers to the possibility of shadow buffers falling out
> that are smaller than currently allocated buffers. Further, the
> _insert_format_items() code appears to explicitly optimize for this
> possibility by reusing the active buffer, subtracting the old size/count
> values from the diff variables and then reformatting the latest
> (presumably smaller) item to the lv.

Individual items might shrink, but the overall transaction should
grow. Think of a extent to btree conversion of an inode fork. THe
data in the inode fork decreases from a list of extents to a btree
root block pointer, so the inode item shrinks. But then we add a new
btree root block that contains all the extents + the btree block
header, and it gets rounded up to ithe 128 byte buffer logging chunk
size.

IOWs, while the inode item has decreased in size, the overall
space consumed by the transaction has gone up and so the CIL ctx
used_space should increase. Hence we can't just look at individual
log items and whether they have decreased in size - we have to look
at all the items in the transaction to understand how the space used
in that transaction has changed. i.e. it's the aggregation of all
items in the transaction that matter here, not so much the
individual items.

> Of course this could just be implementation detail. I haven't dug into
> the details in the remainder of this thread and I don't have specific
> examples off the top of my head, but perhaps based on the ability of
> various structures to change formats and the ability of log vectors to
> shrink in size, shouldn't we expect the possibility of a CIL context to
> shrink in size as well? Just from poking around the CIL it seems like
> the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> for recalculating split res as well)...

Yes, there may be situations where it decreases. It may be this is
fine, but the assumption *I've made* in lots of the CIL push code is
that ctx->used_space rarely, if ever, will go backwards.

e.g. we run the first transaction into the CIL, it steals the sapce
needed for the cil checkpoint headers for the transaciton. Then if
the space returned by the item formatting is negative (because it is
in the AIL and being relogged), the CIL checkpoint now doesn't have
the space reserved it needs to run a checkpoint. That transaction is
a sync transaction, so it forces the log, and now we push the CIL
without sufficient reservation to write out the log headers and the
items we just formatted....

So, yeah, shrinking transaction space usage definitely violates some
of the assumptions the code makes about how relogging works. It's
entirely possible the assumptions I've made are not entirely correct
in some corner cases - those particular cases are what we need to
ferret out here, and then decide if they are correct or not and deal
with it from there...

Cheers,

Dave.

Dave Chinner Jan. 7, 2021, 10:19 p.m. UTC | #9

On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
> On 02.01.21 23:44, Dave Chinner wrote:
> > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> > > On 31.12.20 22:59, Dave Chinner wrote:
> > > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > > > On 30.12.20 23:16, Dave Chinner wrote:
> > 
> > > > One could argue that, but one should also understand the design
> > > > constraints for a particular algorithm are before suggesting
> > > > that their solution is "robust". :)
> > > 
> > > Yes, but an understanding to the extend required by the
> > > argument should be sufficient :-)
> > 
> > And that's the fundamental conceit described by Dunning-Kruger.
> > 
> > I.e. someone thinks they know enough to understand the argument,
> > when in fact they don't understand enough about the subject matter
> > to realise they don't understand what the expert at the other end is
> > saying at all....
> > 
> > > > > # seq 29
> > > > > 
> > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> > > > 
> > > > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > > > 
> > > > .....
> > > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > > > 
> > > > It takes the best part of *50 seconds* before the push work actually
> > > > runs?
> > > > 
> > > > That's .... well and truly screwed up - the work should run on that
> > > > CPU on the very next time it yeilds the CPU. If we're holding the
> > > > CPU without yeilding it for that long, hangcheck and RCU warnings
> > > > should be going off...
> > > 
> > > No such warnings.
> > > 
> > > But the load is probably I/O bound to the log:
> > > 
> > > - creates `cp -a` copies of a directory tree with small files (linux source repository)
> > > - source tree probably completely cached.
> > > - two copies in parallel
> > > - slow log (on software raid6)
> > > 
> > > Isn't it to be expected that sooner or later you need to wait for
> > > log writes when you write as fast as possible with lots of
> > > metadata updates and not so much data?
> > 
> > No, incoming transactions waits for transaction reservation space,
> > not log writes. Reservation space is freed up by metadata writeback.
> > So if you have new transactions blocking in xfs_trans_reserve(),
> > then we are blocking on metadata writeback.
> > 
> > The CIL worker thread may be waiting for log IO completion before it
> > issues more log IO, and in that case it is waiting on iclog buffer
> > space (i.e. only waiting on internal log state, not metadata
> > writeback).  Can you find that kworker thread stack and paste it? If
> > bound on log IO, it will be blocked in xlog_get_iclog_space().
> > 
> > Please paste the entire stack output, too, not just the bits you
> > think are relevant or understand....
> 
> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
> And you mean before the lockup, when the I/O is still active, correct?
> 
> This is the usual stack output from that thread:
> 
> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
> # cat /proc/2080/task/2080/stack
> 
> [<0>] md_flush_request+0x87/0x190
> [<0>] raid5_make_request+0x61b/0xb30
> [<0>] md_handle_request+0x127/0x1a0
> [<0>] md_submit_bio+0xbd/0x100
> [<0>] submit_bio_noacct+0x151/0x410
> [<0>] submit_bio+0x4b/0x1a0
> [<0>] xlog_state_release_iclog+0x87/0xb0
> [<0>] xlog_write+0x452/0x6d0
> [<0>] xlog_cil_push_work+0x2e0/0x4d0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x23f/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> sampled three times with a few seconds in between, stack identical.

Yeah, so that is MD blocking waiting for a running device cache
flush to finish before submitting the iclog IO. IOWs, it's waiting
for hardware to flush volatile caches to stable storage. Every log
IO has a cache flush preceding it, and if the device doesn't support
FUA, then there is a post-IO cache flush as well. These flushes are
necessary for correct IO ordering between the log and metadata/data
writeback.

Essentially, you're waiting for MD to flush all it's dirty stripe
cache to the backing devices and then flush the backing device
caches as well. Nothing can be done at the filesystem level to make
this faster....

> > Also, cp -a of a linux source tree is just as data intensive as it
> > is metadata intensive. There's probably more data IO than metadata
> > IO, so that's more likely to be what is slowing the disks down as
> > metadata writeback is...
> > 
> > > I'm a bit concerned, though, that there seem to be a rather
> > > unlimited (~ 1000) number of kernel worker threads waiting for the
> > > cil push and indirectly for log writes.
> > 
> > That's normal - XFS is highly asynchronous and defers a lot of work
> > to completion threads.
> > 
> > > > So it dropped by 16 bytes (seems to be common) which is
> > > > unexpected.  I wonder if it filled a hole in a buffer and so
> > > > needed one less xlog_op_header()? But then the size would have
> > > > gone up by at least 128 bytes for the hole that was filled, so
> > > > it still shouldn't go down in size.
> > > > 
> > > > I think you need to instrument xlog_cil_insert_items() and catch
> > > > a negative length here:
> > > > 
> > > > 	/* account for space used by new iovec headers  */
> > > > 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > > > 	iovhdr_res; ctx->nvecs += diff_iovecs;
> > > > 
> > > > (diff_iovecs will be negative if the number of xlog_op_header
> > > > structures goes down)
> > > > 
> > > > And if this happens, then dump the transaction ticket via
> > > > xlog_print_trans(tp) so we can see all the log items types and
> > > > vectors that the transaction has formatted...
> > > 
> > > I tried that, but the output was difficult to understand, because
> > > at that point you can only log the complete transaction with the
> > > items already updated.  And a shrinking item is not switched to the
> > > shadow vector, so the formatted content is already overwritten and
> > > not available for analysis.
> > 
> > Yes, that's exactly the information I need to see.
> > 
> > But the fact you think this is something I don't need to know about
> > is classic Dunning-Kruger in action. You don't understand why I
> > asked for this information, and found the information "difficult to
> > understand", so you decided I didn't need it either, despite the
> > fact I asked explicitly for it.
> > 
> > What I first need to know is what operations are being performed by
> > the transaciton that shrunk and what all the items in it are, not
> > which specific items shrunk and by how much. There can be tens to
> > hundreds of items in a single transaction, and it's the combination
> > of the transaction state, the reservation, the amount of the
> > reservation that has been consumed, what items are consuming that
> > reservation, etc. that I need to first see and analyse.
> > 
> > I don't ask for specific information just for fun - I ask for
> > specific information because it is *necessary*. If you want the
> > problem triaged and fixed, then please supply the information you
> > are asked for, even if you don't understand why or what it means.
> 
> I see. I hope, you can make use of the following.

....
> Last two events before the lockup:
> 
> 2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
> 2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
> 2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
> 2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
> 2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
> 2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
> 2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
> 2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
> 2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
> 2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
> 2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
> 2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
> 2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
> 2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
> 2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
> 2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
> 2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
> 2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
> 2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
> 2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
> 2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
> 2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
> 2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
> 2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
> 2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
> 2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
> 2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
> 2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
> 2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
> 2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
> 2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
> 2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
> 2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
> 2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
> 2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.

Ok, they are all of this type - same transaction type/size, same
form, all a reduction in the size of the inode data fork in extent
format. I suspect this is trimming speculative block preallocation
beyond EOF on final close of a written file.

Ok, that matches the case from the root directory you gave earlier,
where the inode data fork was reducing in size from unlinks in a
shortform directory.  Different vector, different inode fork format,
but it appears to result in the same thing where just an inode with
a reducing inode fork size is relogged.

I'll think about this over the weekend and look at it more closely
next week when I'm back from PTO...

Thanks for following up and providing this information, Donald!

Cheers,

Dave.

Brian Foster Jan. 8, 2021, 4:56 p.m. UTC | #10

On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > If the value goes below the limit while some threads are
> > > > already waiting but before the push worker gets to it, these threads are
> > > > not woken.
> > > > 
> > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > optimization. This is possible, because we hold the xc_push_lock
> > > > spinlock, which prevents additions to the waitqueue.
> > > > 
> > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > >  	/*
> > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > >  	 */
> > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > >  		wake_up_all(&cil->xc_push_wait);
> > > 
> > > That just smells wrong to me. It *might* be correct, but this
> > > condition should pair with the sleep condition, as space used by a
> > > CIL context should never actually decrease....
> > > 
> > 
> > ... but I'm a little confused by this assertion. The shadow buffer
> > allocation code refers to the possibility of shadow buffers falling out
> > that are smaller than currently allocated buffers. Further, the
> > _insert_format_items() code appears to explicitly optimize for this
> > possibility by reusing the active buffer, subtracting the old size/count
> > values from the diff variables and then reformatting the latest
> > (presumably smaller) item to the lv.
> 
> Individual items might shrink, but the overall transaction should
> grow. Think of a extent to btree conversion of an inode fork. THe
> data in the inode fork decreases from a list of extents to a btree
> root block pointer, so the inode item shrinks. But then we add a new
> btree root block that contains all the extents + the btree block
> header, and it gets rounded up to ithe 128 byte buffer logging chunk
> size.
> 
> IOWs, while the inode item has decreased in size, the overall
> space consumed by the transaction has gone up and so the CIL ctx
> used_space should increase. Hence we can't just look at individual
> log items and whether they have decreased in size - we have to look
> at all the items in the transaction to understand how the space used
> in that transaction has changed. i.e. it's the aggregation of all
> items in the transaction that matter here, not so much the
> individual items.
> 

Ok, that makes more sense...

> > Of course this could just be implementation detail. I haven't dug into
> > the details in the remainder of this thread and I don't have specific
> > examples off the top of my head, but perhaps based on the ability of
> > various structures to change formats and the ability of log vectors to
> > shrink in size, shouldn't we expect the possibility of a CIL context to
> > shrink in size as well? Just from poking around the CIL it seems like
> > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > for recalculating split res as well)...
> 
> Yes, there may be situations where it decreases. It may be this is
> fine, but the assumption *I've made* in lots of the CIL push code is
> that ctx->used_space rarely, if ever, will go backwards.
> 

... and rarely seems a bit more pragmatic than never.

> e.g. we run the first transaction into the CIL, it steals the sapce
> needed for the cil checkpoint headers for the transaciton. Then if
> the space returned by the item formatting is negative (because it is
> in the AIL and being relogged), the CIL checkpoint now doesn't have
> the space reserved it needs to run a checkpoint. That transaction is
> a sync transaction, so it forces the log, and now we push the CIL
> without sufficient reservation to write out the log headers and the
> items we just formatted....
> 

Hmmm... that seems like an odd scenario because I'd expect the space
usage delta to reflect what might or might not have already been added
to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
delta only occur for items being relogged while still CIL resident
(regardless of AIL residency)?

From a code standpoint, the way a particular log item delta comes out
negative is from having a shadow lv size smaller than the ->li_lv size.
Thus, xlog_cil_insert_format_items() subtracts the currently formatted
lv size from the delta, formats the current state of the item, and
xfs_cil_prepare_item() adds the new (presumably smaller) size to the
delta. We reuse ->li_lv in this scenario so both it and the shadow
buffer remain, but a CIL push pulls ->li_lv from all log items and
chains them to the CIL context for writing, so I don't see how we could
have an item return a negative delta on an empty CIL. Hm?

(I was also wondering whether repeated smaller relogs of an item could
be a vector for this to go wrong, but it looks like
xlog_cil_insert_format_items() always uses the formatted size of the
current buffer...).

Brian

> So, yeah, shrinking transaction space usage definitely violates some
> of the assumptions the code makes about how relogging works. It's
> entirely possible the assumptions I've made are not entirely correct
> in some corner cases - those particular cases are what we need to
> ferret out here, and then decide if they are correct or not and deal
> with it from there...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Donald Buczek Jan. 9, 2021, 2:39 p.m. UTC | #11

On 07.01.21 23:19, Dave Chinner wrote:
> On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
>> On 02.01.21 23:44, Dave Chinner wrote:
>>> On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
>>>> On 31.12.20 22:59, Dave Chinner wrote:
>>>>> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>>>>>> On 30.12.20 23:16, Dave Chinner wrote:
>>>
>>>>> One could argue that, but one should also understand the design
>>>>> constraints for a particular algorithm are before suggesting
>>>>> that their solution is "robust". :)
>>>>
>>>> Yes, but an understanding to the extend required by the
>>>> argument should be sufficient :-)
>>>
>>> And that's the fundamental conceit described by Dunning-Kruger.
>>>
>>> I.e. someone thinks they know enough to understand the argument,
>>> when in fact they don't understand enough about the subject matter
>>> to realise they don't understand what the expert at the other end is
>>> saying at all....
>>>
>>>>>> # seq 29
>>>>>>
>>>>>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
>>>>>
>>>>> So, at 20:08:15 we get a push trigger and the work is queued. But...
>>>>>
>>>>> .....
>>>>>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
>>>>>
>>>>> It takes the best part of *50 seconds* before the push work actually
>>>>> runs?
>>>>>
>>>>> That's .... well and truly screwed up - the work should run on that
>>>>> CPU on the very next time it yeilds the CPU. If we're holding the
>>>>> CPU without yeilding it for that long, hangcheck and RCU warnings
>>>>> should be going off...
>>>>
>>>> No such warnings.
>>>>
>>>> But the load is probably I/O bound to the log:
>>>>
>>>> - creates `cp -a` copies of a directory tree with small files (linux source repository)
>>>> - source tree probably completely cached.
>>>> - two copies in parallel
>>>> - slow log (on software raid6)
>>>>
>>>> Isn't it to be expected that sooner or later you need to wait for
>>>> log writes when you write as fast as possible with lots of
>>>> metadata updates and not so much data?
>>>
>>> No, incoming transactions waits for transaction reservation space,
>>> not log writes. Reservation space is freed up by metadata writeback.
>>> So if you have new transactions blocking in xfs_trans_reserve(),
>>> then we are blocking on metadata writeback.
>>>
>>> The CIL worker thread may be waiting for log IO completion before it
>>> issues more log IO, and in that case it is waiting on iclog buffer
>>> space (i.e. only waiting on internal log state, not metadata
>>> writeback).  Can you find that kworker thread stack and paste it? If
>>> bound on log IO, it will be blocked in xlog_get_iclog_space().
>>>
>>> Please paste the entire stack output, too, not just the bits you
>>> think are relevant or understand....
>>
>> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
>> And you mean before the lockup, when the I/O is still active, correct?
>>
>> This is the usual stack output from that thread:
>>
>> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
>> # cat /proc/2080/task/2080/stack
>>
>> [<0>] md_flush_request+0x87/0x190
>> [<0>] raid5_make_request+0x61b/0xb30
>> [<0>] md_handle_request+0x127/0x1a0
>> [<0>] md_submit_bio+0xbd/0x100
>> [<0>] submit_bio_noacct+0x151/0x410
>> [<0>] submit_bio+0x4b/0x1a0
>> [<0>] xlog_state_release_iclog+0x87/0xb0
>> [<0>] xlog_write+0x452/0x6d0
>> [<0>] xlog_cil_push_work+0x2e0/0x4d0
>> [<0>] process_one_work+0x1dd/0x3e0
>> [<0>] worker_thread+0x23f/0x3b0
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> sampled three times with a few seconds in between, stack identical.
> 
> Yeah, so that is MD blocking waiting for a running device cache
> flush to finish before submitting the iclog IO. IOWs, it's waiting
> for hardware to flush volatile caches to stable storage. Every log
> IO has a cache flush preceding it, and if the device doesn't support
> FUA, then there is a post-IO cache flush as well. These flushes are
> necessary for correct IO ordering between the log and metadata/data
> writeback.
> 
> Essentially, you're waiting for MD to flush all it's dirty stripe
> cache to the backing devices and then flush the backing device
> caches as well. Nothing can be done at the filesystem level to make
> this faster....
> 
>>> Also, cp -a of a linux source tree is just as data intensive as it
>>> is metadata intensive. There's probably more data IO than metadata
>>> IO, so that's more likely to be what is slowing the disks down as
>>> metadata writeback is...
>>>
>>>> I'm a bit concerned, though, that there seem to be a rather
>>>> unlimited (~ 1000) number of kernel worker threads waiting for the
>>>> cil push and indirectly for log writes.
>>>
>>> That's normal - XFS is highly asynchronous and defers a lot of work
>>> to completion threads.
>>>
>>>>> So it dropped by 16 bytes (seems to be common) which is
>>>>> unexpected.  I wonder if it filled a hole in a buffer and so
>>>>> needed one less xlog_op_header()? But then the size would have
>>>>> gone up by at least 128 bytes for the hole that was filled, so
>>>>> it still shouldn't go down in size.
>>>>>
>>>>> I think you need to instrument xlog_cil_insert_items() and catch
>>>>> a negative length here:
>>>>>
>>>>> 	/* account for space used by new iovec headers  */
>>>>> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
>>>>> 	iovhdr_res; ctx->nvecs += diff_iovecs;
>>>>>
>>>>> (diff_iovecs will be negative if the number of xlog_op_header
>>>>> structures goes down)
>>>>>
>>>>> And if this happens, then dump the transaction ticket via
>>>>> xlog_print_trans(tp) so we can see all the log items types and
>>>>> vectors that the transaction has formatted...
>>>>
>>>> I tried that, but the output was difficult to understand, because
>>>> at that point you can only log the complete transaction with the
>>>> items already updated.  And a shrinking item is not switched to the
>>>> shadow vector, so the formatted content is already overwritten and
>>>> not available for analysis.
>>>
>>> Yes, that's exactly the information I need to see.
>>>
>>> But the fact you think this is something I don't need to know about
>>> is classic Dunning-Kruger in action. You don't understand why I
>>> asked for this information, and found the information "difficult to
>>> understand", so you decided I didn't need it either, despite the
>>> fact I asked explicitly for it.
>>>
>>> What I first need to know is what operations are being performed by
>>> the transaciton that shrunk and what all the items in it are, not
>>> which specific items shrunk and by how much. There can be tens to
>>> hundreds of items in a single transaction, and it's the combination
>>> of the transaction state, the reservation, the amount of the
>>> reservation that has been consumed, what items are consuming that
>>> reservation, etc. that I need to first see and analyse.
>>>
>>> I don't ask for specific information just for fun - I ask for
>>> specific information because it is *necessary*. If you want the
>>> problem triaged and fixed, then please supply the information you
>>> are asked for, even if you don't understand why or what it means.
>>
>> I see. I hope, you can make use of the following.
> 
> ....
>> Last two events before the lockup:
>>
>> 2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
>> 2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
>> 2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
>> 2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
>> 2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
>> 2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
>> 2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
>> 2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
>> 2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
>> 2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
>> 2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
>> 2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
>> 2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
>> 2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
>> 2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
>> 2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
>> 2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
>> 2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
>> 2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
>> 2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
>> 2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
>> 2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
>> 2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
>> 2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
>> 2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
>> 2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
>> 2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
>> 2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
>> 2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
>> 2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
>> 2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> 2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
>> 2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
>> 2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
>> 2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
>> 2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.
> 
> Ok, they are all of this type - same transaction type/size, same
> form, all a reduction in the size of the inode data fork in extent
> format. I suspect this is trimming speculative block preallocation
> beyond EOF on final close of a written file.

Btw: I looked up such a file by its inode number and it was a rather
long one (git pack).

But please keep in mind, that this log was restricted to the "only
write to fresh filesystem" scenario. If you do something else on the
filesystem, there appear to be other cases than just the trimming
of the last extent.

I've collected some more log at
http://owww.molgen.mpg.de/~buczek/2021-01-03_13-09/log2.txt

The swap to the shadow buffer was made unconditional so that the
old vectors could be logged together with the new vectors. And I
logged a few bytes more from the buffer. A post-processing
script provided the summary header and removed duplicates.

The first events are from the "cp" with the truncation of the data
fork as seen before. But starting with the transaction at 881.111968
there are six of another kind with a very high diff_len (and
apparently items without an allocated vector).

After 945.723769 I playfully tried to force truncation events at
the attribute fork, so this is not new, just the other fork.

Donald

> Ok, that matches the case from the root directory you gave earlier,
> where the inode data fork was reducing in size from unlinks in a
> shortform directory.  Different vector, different inode fork format,
> but it appears to result in the same thing where just an inode with
> a reducing inode fork size is relogged.
> 
> 
> I'll think about this over the weekend and look at it more closely
> next week when I'm back from PTO...
> 
> Thanks for following up and providing this information, Donald!
> 
> Cheers,
> 
> Dave.
>

Brian Foster Jan. 11, 2021, 4:38 p.m. UTC | #12

On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > If the value goes below the limit while some threads are
> > > > > already waiting but before the push worker gets to it, these threads are
> > > > > not woken.
> > > > > 
> > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > spinlock, which prevents additions to the waitqueue.
> > > > > 
> > > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > >  	/*
> > > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > > >  	 */
> > > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > > >  		wake_up_all(&cil->xc_push_wait);
> > > > 
> > > > That just smells wrong to me. It *might* be correct, but this
> > > > condition should pair with the sleep condition, as space used by a
> > > > CIL context should never actually decrease....
> > > > 
> > > 
> > > ... but I'm a little confused by this assertion. The shadow buffer
> > > allocation code refers to the possibility of shadow buffers falling out
> > > that are smaller than currently allocated buffers. Further, the
> > > _insert_format_items() code appears to explicitly optimize for this
> > > possibility by reusing the active buffer, subtracting the old size/count
> > > values from the diff variables and then reformatting the latest
> > > (presumably smaller) item to the lv.
> > 
> > Individual items might shrink, but the overall transaction should
> > grow. Think of a extent to btree conversion of an inode fork. THe
> > data in the inode fork decreases from a list of extents to a btree
> > root block pointer, so the inode item shrinks. But then we add a new
> > btree root block that contains all the extents + the btree block
> > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > size.
> > 
> > IOWs, while the inode item has decreased in size, the overall
> > space consumed by the transaction has gone up and so the CIL ctx
> > used_space should increase. Hence we can't just look at individual
> > log items and whether they have decreased in size - we have to look
> > at all the items in the transaction to understand how the space used
> > in that transaction has changed. i.e. it's the aggregation of all
> > items in the transaction that matter here, not so much the
> > individual items.
> > 
> 
> Ok, that makes more sense...
> 
> > > Of course this could just be implementation detail. I haven't dug into
> > > the details in the remainder of this thread and I don't have specific
> > > examples off the top of my head, but perhaps based on the ability of
> > > various structures to change formats and the ability of log vectors to
> > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > shrink in size as well? Just from poking around the CIL it seems like
> > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > for recalculating split res as well)...
> > 
> > Yes, there may be situations where it decreases. It may be this is
> > fine, but the assumption *I've made* in lots of the CIL push code is
> > that ctx->used_space rarely, if ever, will go backwards.
> > 
> 
> ... and rarely seems a bit more pragmatic than never.
> 

FWIW, a cursory look at the inode size/format code (motivated by
Donald's recent log dump that appears to show inode log items changing
size) suggested that a simple local format size change might be enough
to cause this condition on an item. A subsequent test to create and
immediately remove a file from an otherwise empty directory triggers a
tracepoint I injected in xlog_cil_insert_items() to detect a negative
transaction delta. As expected, the absolute value of the delta does
seem to increase with a larger filename. This also produces a negative
iovec delta, fwiw. E.g.:

# touch `for i in $(seq 0 63); do echo -n a; done`
# rm -f `for i in $(seq 0 63); do echo -n a; done`
#

rm-9265    [001] ....  4660.177806: xfs_log_commit_cil: 409: len -72 diff_iovecs 0
rm-9265    [001] .N.1  4660.177913: xfs_log_commit_cil: 419: len -72 diff_iovecs 0
rm-9265    [001] ....  4660.178313: xfs_log_commit_cil: 409: len -52 diff_iovecs -1
rm-9265    [001] ...1  4660.178336: xfs_log_commit_cil: 419: len -64 diff_iovecs -1

... and this only seems to occur when the unlink occurs before the CIL
has been checkpointed and pushed out the inode (i.e. a freeze/unfreeze
cycle prevents it).

I've not dug into the transaction details and have no idea if this is
the variant that Donald reproduces; it wouldn't surprise me a ton if
there were various others. This is pretty straightforward, however, and
shows the negative item delta carry through the transaction. IMO, that
seems to justify a throttling fix...

Brian

> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted....
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
> 
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?
> 
> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).
> 
> Brian
> 
> > So, yeah, shrinking transaction space usage definitely violates some
> > of the assumptions the code makes about how relogging works. It's
> > entirely possible the assumptions I've made are not entirely correct
> > in some corner cases - those particular cases are what we need to
> > ferret out here, and then decide if they are correct or not and deal
> > with it from there...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>

Dave Chinner Jan. 13, 2021, 9:44 p.m. UTC | #13

On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted....
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
>
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?

In theory, yes. But like I said, I've made a bunch of assumptions
that this won't happen, and so without actually auditting the code
I'm not actually sure that it won't. i.e. I need to go check what
happens with items being marked stale, how shadow buffer
reallocation interacts with items that end up being smaller than the
existing buffer, etc. I've paged a lot of this detail out of my
brain, so until I spend the time to go over it all again I'm not
going to be able to answer definitively.

> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).

That's my nagging doubt about all this - that there's an interaction
of this nature that I haven't considered due to the assumptions I've
made that allows underflow to occur. That would be much worse than
the current situation of hanging on a missing wakeup when the CIL is
full and used_space goes backwards....

Cheers,

Dave.

Dave Chinner Jan. 13, 2021, 9:53 p.m. UTC | #14

On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > > If the value goes below the limit while some threads are
> > > > > > already waiting but before the push worker gets to it, these threads are
> > > > > > not woken.
> > > > > > 
> > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > > spinlock, which prevents additions to the waitqueue.
> > > > > > 
> > > > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > > > ---
> > > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > > >  	/*
> > > > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > > > >  	 */
> > > > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > > > >  		wake_up_all(&cil->xc_push_wait);
> > > > > 
> > > > > That just smells wrong to me. It *might* be correct, but this
> > > > > condition should pair with the sleep condition, as space used by a
> > > > > CIL context should never actually decrease....
> > > > > 
> > > > 
> > > > ... but I'm a little confused by this assertion. The shadow buffer
> > > > allocation code refers to the possibility of shadow buffers falling out
> > > > that are smaller than currently allocated buffers. Further, the
> > > > _insert_format_items() code appears to explicitly optimize for this
> > > > possibility by reusing the active buffer, subtracting the old size/count
> > > > values from the diff variables and then reformatting the latest
> > > > (presumably smaller) item to the lv.
> > > 
> > > Individual items might shrink, but the overall transaction should
> > > grow. Think of a extent to btree conversion of an inode fork. THe
> > > data in the inode fork decreases from a list of extents to a btree
> > > root block pointer, so the inode item shrinks. But then we add a new
> > > btree root block that contains all the extents + the btree block
> > > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > > size.
> > > 
> > > IOWs, while the inode item has decreased in size, the overall
> > > space consumed by the transaction has gone up and so the CIL ctx
> > > used_space should increase. Hence we can't just look at individual
> > > log items and whether they have decreased in size - we have to look
> > > at all the items in the transaction to understand how the space used
> > > in that transaction has changed. i.e. it's the aggregation of all
> > > items in the transaction that matter here, not so much the
> > > individual items.
> > > 
> > 
> > Ok, that makes more sense...
> > 
> > > > Of course this could just be implementation detail. I haven't dug into
> > > > the details in the remainder of this thread and I don't have specific
> > > > examples off the top of my head, but perhaps based on the ability of
> > > > various structures to change formats and the ability of log vectors to
> > > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > > shrink in size as well? Just from poking around the CIL it seems like
> > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > > for recalculating split res as well)...
> > > 
> > > Yes, there may be situations where it decreases. It may be this is
> > > fine, but the assumption *I've made* in lots of the CIL push code is
> > > that ctx->used_space rarely, if ever, will go backwards.
> > > 
> > 
> > ... and rarely seems a bit more pragmatic than never.
> > 
> 
> FWIW, a cursory look at the inode size/format code (motivated by
> Donald's recent log dump that appears to show inode log items changing
> size) suggested that a simple local format size change might be enough
> to cause this condition on an item. A subsequent test to create and
> immediately remove a file from an otherwise empty directory triggers a
> tracepoint I injected in xlog_cil_insert_items() to detect a negative
> transaction delta. As expected, the absolute value of the delta does
> seem to increase with a larger filename. This also produces a negative
> iovec delta, fwiw. E.g.:
> 
> # touch `for i in $(seq 0 63); do echo -n a; done`
> # rm -f `for i in $(seq 0 63); do echo -n a; done`
> #
> 
> rm-9265    [001] ....  4660.177806: xfs_log_commit_cil: 409: len -72 diff_iovecs 0
> rm-9265    [001] .N.1  4660.177913: xfs_log_commit_cil: 419: len -72 diff_iovecs 0
> rm-9265    [001] ....  4660.178313: xfs_log_commit_cil: 409: len -52 diff_iovecs -1
> rm-9265    [001] ...1  4660.178336: xfs_log_commit_cil: 419: len -64 diff_iovecs -1
> 
> ... and this only seems to occur when the unlink occurs before the CIL
> has been checkpointed and pushed out the inode (i.e. a freeze/unfreeze
> cycle prevents it).

Yeha, it's a shortform directory removal that triggers it easily
because the other items being modified in the transaction aren't
changing size on relogging (AGI unlink list pointer, unlinked inode
core for nlink change). Hence the reduction in size of the directory
inode reduces the overall CIL size...

> I've not dug into the transaction details and have no idea if this is
> the variant that Donald reproduces; it wouldn't surprise me a ton if
> there were various others. This is pretty straightforward, however, and
> shows the negative item delta carry through the transaction. IMO, that
> seems to justify a throttling fix...

I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gun and making the wrong fix for the wrong reasons that
just papers over the underlying problems that the throttling bug has
made us aware of...

Cheers,

Dave.

Donald Buczek Feb. 15, 2021, 1:36 p.m. UTC | #15

On 13.01.21 22:53, Dave Chinner wrote:
> [...]
> I agree that a throttling fix is needed, but I'm trying to
> understand the scope and breadth of the problem first instead of
> jumping the gun and making the wrong fix for the wrong reasons that
> just papers over the underlying problems that the throttling bug has
> made us aware of...

Are you still working on this?

If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.

This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.

Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.

Why is

     if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
         wake_up_all(&cil->xc_push_wait);

, which doesn't work reliably, preferable to

     if (waitqueue_active(&cil->xc_push_wait))
         wake_up_all(&cil->xc_push_wait);

which does?

Best
   Donald

> Cheers,
> 
> Dave

Brian Foster Feb. 16, 2021, 11:18 a.m. UTC | #16

On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
> On 13.01.21 22:53, Dave Chinner wrote:
> > [...]
> > I agree that a throttling fix is needed, but I'm trying to
> > understand the scope and breadth of the problem first instead of
> > jumping the gun and making the wrong fix for the wrong reasons that
> > just papers over the underlying problems that the throttling bug has
> > made us aware of...
> 
> Are you still working on this?
> 
> If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.
> 
> This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.
> 
> Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.
> 
> Why is
> 
>     if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
>         wake_up_all(&cil->xc_push_wait);
> 
> , which doesn't work reliably, preferable to
> 
>     if (waitqueue_active(&cil->xc_push_wait))
>         wake_up_all(&cil->xc_push_wait);
> 
> which does?
> 

JFYI, Dave followed up with a patch a couple weeks or so ago:

https://lore.kernel.org/linux-xfs/20210128044154.806715-5-david@fromorbit.com/

Brian

> Best
>   Donald
> 
> > Cheers,
> > 
> > Dave
>

Donald Buczek Feb. 16, 2021, 12:40 p.m. UTC | #17

On 16.02.21 12:18, Brian Foster wrote:
> On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
>> On 13.01.21 22:53, Dave Chinner wrote:
>>> [...]
>>> I agree that a throttling fix is needed, but I'm trying to
>>> understand the scope and breadth of the problem first instead of
>>> jumping the gun and making the wrong fix for the wrong reasons that
>>> just papers over the underlying problems that the throttling bug has
>>> made us aware of...
>>
>> Are you still working on this?
>>
>> If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.
>>
>> This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.
>>
>> Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.
>>
>> Why is
>>
>>      if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
>>          wake_up_all(&cil->xc_push_wait);
>>
>> , which doesn't work reliably, preferable to
>>
>>      if (waitqueue_active(&cil->xc_push_wait))
>>          wake_up_all(&cil->xc_push_wait);
>>
>> which does?
>>
> 
> JFYI, Dave followed up with a patch a couple weeks or so ago:
> 
> https://lore.kernel.org/linux-xfs/20210128044154.806715-5-david@fromorbit.com/

Oh, great. I apologize for the unneeded reminder.

Best

   Donald

> 
> Brian
> 
>> Best
>>    Donald
>>
>>> Cheers,
>>>
>>> Dave
>>
>

xfs: Wake CIL push waiters more reliably

Commit Message

Comments

Patch