diff mbox series

[1/4] xfs: force all buffers to be written during btree bulk load

Message ID 170086926593.2770816.5504104328549141972.stgit@frogsfrogsfrogs (mailing list archive)
State Superseded
Headers show
Series xfs: prepare repair for bulk loading | expand

Commit Message

Darrick J. Wong Nov. 24, 2023, 11:48 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffer readers
encountering buffers with DELWRI_Q set, even though the btree bulk load
had already committed and the buffer itself wasn't on any delwri list.

I traced this to a misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list.  If a buffer is
logged and committed, the buffer can end up on that AIL buffer list.  If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up.  This clears DELWRI_Q from the buffer state.

If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block.  Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.

When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.

This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer.  This ordering requirement is
required for data consistency.

Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.

To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list.  By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_staging.c |    4 +--
 fs/xfs/xfs_buf.c                  |   47 ++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_buf.h                  |    1 +
 3 files changed, 45 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Nov. 25, 2023, 5:49 a.m. UTC | #1
The code makes me feel we're fixing the wrong thing, but I have to
admin I don't fully understand it. Let me go step by step.

> While stress-testing online repair of btrees, I noticed periodic
> assertion failures from the buffer cache about buffer readers
> encountering buffers with DELWRI_Q set, even though the btree bulk load
> had already committed and the buffer itself wasn't on any delwri list.

What assert do these buffer reader hit?  The two I can spot that are in
the read path in the broader sense are:

  1) in xfs_buf_find_lock for stale buffers.
  2) in __xfs_buf_submit just before I/O submission

> I traced this to a misunderstanding of how the delwri lists work,
> particularly with regards to the AIL's buffer list.  If a buffer is
> logged and committed, the buffer can end up on that AIL buffer list.  If
> btree repairs are run twice in rapid succession, it's possible that the
> first repair will invalidate the buffer and free it before the next time
> the AIL wakes up.  This clears DELWRI_Q from the buffer state.

Where "this clears" is xfs_buf_stale called from xfs_btree_free_block
via xfs_trans_binval?

> If the second repair allocates the same block, it will then recycle the
> buffer to start writing the new btree block.

If my above theory is correct: how do we end up reusing a stale buffer?
If not, what am I misunderstanding above?

> Meanwhile, if the AIL
> wakes up and walks the buffer list, it will ignore the buffer because it
> can't lock it, and go back to sleep.

And I think this is where the trouble starts - we have a buffer that
is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared,
it is stale and we then reuse it.  I don't think we just need to kick
it off the delwri list just for btree staging, but in general.

> 
> When the second repair calls delwri_queue to put the buffer on the
> list of buffers to write before committing the new btree, it will set
> DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
> buffer list, it won't add it to the bulkload buffer's list.
>
> This is incorrect, because the bulkload caller relies on delwri_submit
> to ensure that all the buffers have been sent to disk /before/
> committing the new btree root pointer.  This ordering requirement is
> required for data consistency.
>
> Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
> drop it, so the next thread to walk through the btree will trip over a
> debug assertion on that flag.

Where do it finally drop it?

> To fix this, create a new function that waits for the buffer to be
> removed from any other delwri lists before adding the buffer to the
> caller's delwri list.  By waiting for the buffer to clear both the
> delwri list and any potential delwri wait list, we can be sure that
> repair will initiate writes of all buffers and report all write errors
> back to userspace instead of committing the new structure.

If my understanding above is correct this just papers over the bug
that a buffer that is marked stale and can be reused for something
else is left on a delwri list.  I've entirely thought about all the
consequence, but here is what I'd try:

 - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q
   call your new wait code instead of asserting (probably only
   for the !trylock case)
 - make sure we don't leak DELWRI_Q
Darrick J. Wong Nov. 28, 2023, 1:50 a.m. UTC | #2
On Fri, Nov 24, 2023 at 09:49:28PM -0800, Christoph Hellwig wrote:
> The code makes me feel we're fixing the wrong thing, but I have to
> admin I don't fully understand it. Let me go step by step.
> 
> > While stress-testing online repair of btrees, I noticed periodic
> > assertion failures from the buffer cache about buffer readers
> > encountering buffers with DELWRI_Q set, even though the btree bulk load
> > had already committed and the buffer itself wasn't on any delwri list.
> 
> What assert do these buffer reader hit?  The two I can spot that are in
> the read path in the broader sense are:
> 
>   1) in xfs_buf_find_lock for stale buffers.
>   2) in __xfs_buf_submit just before I/O submission

The second assert:

AIL:	Repair0:	Repair1:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

	stale buf X:
	clear DELWRI_Q
	does not clear b_list
	free space X
	commit

delwri_submit	# oops

Though there's more to this.

> > I traced this to a misunderstanding of how the delwri lists work,
> > particularly with regards to the AIL's buffer list.  If a buffer is
> > logged and committed, the buffer can end up on that AIL buffer list.  If
> > btree repairs are run twice in rapid succession, it's possible that the
> > first repair will invalidate the buffer and free it before the next time
> > the AIL wakes up.  This clears DELWRI_Q from the buffer state.
> 
> Where "this clears" is xfs_buf_stale called from xfs_btree_free_block
> via xfs_trans_binval?

Yes.  I'll rework the last sentence to read: "Marking the buffer stale
clears DELWRI_Q from the buffer state without removing the buffer from
its delwri list."

> > If the second repair allocates the same block, it will then recycle the
> > buffer to start writing the new btree block.
> 
> If my above theory is correct: how do we end up reusing a stale buffer?
> If not, what am I misunderstanding above?

If we free a metadata buffer, we'll mark it stale, update the bnobt, and
add an entry to the extent busy list.  If a later repair finds:

1. The same free space in the bnobt;
2. The free space exactly coincides with one extent busy list entry;
3. The entry isn't in the middle of discarding the block;

Then the allocator will remove the extent busy list entry and let us
have the space.  At that point we could have a stale buffer that's also
on one of the AIL's lists:

AIL:	Repair0:	Repair1:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

	stale buf X:
	clear DELWRI_Q
	does not clear b_list
	free space X
	commit

			find free space X
			get buffer
			rewrite buffer
			delwri_queue:
			set DELWRI_Q
			already on a list, do not add
			commit

			BAD: committed tree root before all blocks written

delwri_submit	# too late now

I could demonstrate this by injecting dmerror while delwri(ting) the new
btree blocks, and observing that some of the EIOs would not trigger the
rollback but instead would trip IO errors in the AIL.

> > wakes up and walks the buffer list, it will ignore the buffer because it
> > can't lock it, and go back to sleep.
> 
> And I think this is where the trouble starts - we have a buffer that
> is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared,
> it is stale and we then reuse it.  I don't think we just need to kick
> it off the delwri list just for btree staging, but in general.

...but how to do that?  We don't know whose delwri list the buffer's on,
let alone how to lock the list so that we can remove the buffer from
that list.

(Oh, you have some suggestions below.)

> > When the second repair calls delwri_queue to put the buffer on the
> > list of buffers to write before committing the new btree, it will set
> > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
> > buffer list, it won't add it to the bulkload buffer's list.
> >
> > This is incorrect, because the bulkload caller relies on delwri_submit
> > to ensure that all the buffers have been sent to disk /before/
> > committing the new btree root pointer.  This ordering requirement is
> > required for data consistency.
> >
> > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
> > drop it, so the next thread to walk through the btree will trip over a
> > debug assertion on that flag.
> 
> Where do it finally drop it?

TBH, it's been so long that I can't remember anymore. :(

> > To fix this, create a new function that waits for the buffer to be
> > removed from any other delwri lists before adding the buffer to the
> > caller's delwri list.  By waiting for the buffer to clear both the
> > delwri list and any potential delwri wait list, we can be sure that
> > repair will initiate writes of all buffers and report all write errors
> > back to userspace instead of committing the new structure.
> 
> If my understanding above is correct this just papers over the bug
> that a buffer that is marked stale and can be reused for something
> else is left on a delwri list.

I'm not sure it's a bug for *most* code that encounters it.  Regular
transactions don't directly use the delwri lists, so it will never see
that at all.  If the buffer gets rewritten and logged, then it'll just
end up on the AIL's delwri buffer list again.

The other delwri_submit users don't seem to care if the buffer gets
written directly by delwri_submit or by the AIL.  In the first case, the
caller will get the EIO and force a shutdown; in the second case, the
AIL will shut down the log.

Weirdly, xfs_bwrite clears DELWRI_Q but doesn't necessary shut down the
fs if the write fails.

Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does
it mean "b_list is active"?  Or does it merely mean "another thread will
write this buffer via b_list if nobody gets there first"?  I think it's
the second, since it's easy enough to list_empty().

It's only repair that requires this new behavior that all new buffers
have to be written through the delwri list, and only because it actually
/can/ cancel the transaction by finishing the autoreap EFIs.

> I've entirely thought about all the
> consequence, but here is what I'd try:
> 
>  - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q
>    call your new wait code instead of asserting (probably only
>    for the !trylock case)
>  - make sure we don't leak DELWRI_Q 

Way back when I first discovered this, my first impulse was to make
xfs_buf_stale wait for DELWRI_Q to clear.  That IIRC didn't work because
a transaction could hold an inode that the AIL will need to lock.  I
think xfs_buf_find_lock would have the same problem.

Seeing as repair is the only user with the requirement that it alone can
issue writes for the delwri list buffers, that's why I went with what's
in this patch.

Thank you for persevering so far! :D

--D
Christoph Hellwig Nov. 28, 2023, 7:13 a.m. UTC | #3
> > If my understanding above is correct this just papers over the bug
> > that a buffer that is marked stale and can be reused for something
> > else is left on a delwri list.
> 
> I'm not sure it's a bug for *most* code that encounters it.  Regular
> transactions don't directly use the delwri lists, so it will never see
> that at all. If the buffer gets rewritten and logged, then it'll just
> end up on the AIL's delwri buffer list again.

Where "regular transactions" means data I/O, yes.  All normal metadata
I/O eventually ends up on a delwri list.  And we want it on that delwri
list after actually updating the contents.

> Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does
> it mean "b_list is active"?  Or does it merely mean "another thread will
> write this buffer via b_list if nobody gets there first"?  I think it's
> the second, since it's easy enough to list_empty().

I think it is misnamed and not clearly defined, and we should probably
fix that.  Note that least the current list_empty() usage isn't quite
correct either without a lock held by the delwri list owner.  We at
least need a list_empty_careful for a racey but still save check.

Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely.
Except for various asserts we mostly use it in reverse polarity, that is
to cancel delwri writeout for stale buffers.  How about just skipping
XBF_STALE buffers on the delwri list directly and not bother with
DELWRI_Q?  With that we can then support multiple delwri lists that
don't get into each others way using say an allocating xarray instead
of the invase list and avoid this whole mess.

Let me try to prototype this..

> It's only repair that requires this new behavior that all new buffers
> have to be written through the delwri list, and only because it actually
> /can/ cancel the transaction by finishing the autoreap EFIs.

Right now yes.  But I think the delwri behavior is a land mine, and
this just happens to be the first user to trigger it.  Edit: while
looking through the DELWRI_Q usage I noticed xfs_qm_flush_one, which
seems to deal with what is at least a somewhat related issue based
on the comments there.

> Way back when I first discovered this, my first impulse was to make
> xfs_buf_stale wait for DELWRI_Q to clear.  That IIRC didn't work because
> a transaction could hold an inode that the AIL will need to lock.  I
> think xfs_buf_find_lock would have the same problem.

Yes, that makes sense.  Would be great to document such details in the
commit message..
Christoph Hellwig Nov. 28, 2023, 3:18 p.m. UTC | #4
On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote:
> > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does
> > it mean "b_list is active"?  Or does it merely mean "another thread will
> > write this buffer via b_list if nobody gets there first"?  I think it's
> > the second, since it's easy enough to list_empty().
> 
> I think it is misnamed and not clearly defined, and we should probably
> fix that.  Note that least the current list_empty() usage isn't quite
> correct either without a lock held by the delwri list owner.  We at
> least need a list_empty_careful for a racey but still save check.
> 
> Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely.
> Except for various asserts we mostly use it in reverse polarity, that is
> to cancel delwri writeout for stale buffers.  How about just skipping
> XBF_STALE buffers on the delwri list directly and not bother with
> DELWRI_Q?  With that we can then support multiple delwri lists that
> don't get into each others way using say an allocating xarray instead
> of the invase list and avoid this whole mess.
> 
> Let me try to prototype this..

Ok, I spent half the day prototyping this and it passes basic sanity
checks:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists

My conclusions from that is:

 1) it works
 2) I think it is the right to do
 3) it needs a lot more work
 4) we can't block the online repair work on it

so I guess we'll need to go with the approach in this patch for now,
maybe with a better commit log, and I'll look into finishing this work
some time in the future.
Darrick J. Wong Nov. 28, 2023, 5:07 p.m. UTC | #5
On Tue, Nov 28, 2023 at 07:18:22AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote:
> > > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does
> > > it mean "b_list is active"?  Or does it merely mean "another thread will
> > > write this buffer via b_list if nobody gets there first"?  I think it's
> > > the second, since it's easy enough to list_empty().
> > 
> > I think it is misnamed and not clearly defined, and we should probably
> > fix that.  Note that least the current list_empty() usage isn't quite
> > correct either without a lock held by the delwri list owner.  We at
> > least need a list_empty_careful for a racey but still save check.
> > 
> > Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely.
> > Except for various asserts we mostly use it in reverse polarity, that is
> > to cancel delwri writeout for stale buffers.  How about just skipping
> > XBF_STALE buffers on the delwri list directly and not bother with
> > DELWRI_Q?  With that we can then support multiple delwri lists that
> > don't get into each others way using say an allocating xarray instead
> > of the invase list and avoid this whole mess.
> > 
> > Let me try to prototype this..
> 
> Ok, I spent half the day prototyping this and it passes basic sanity
> checks:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists
> 
> My conclusions from that is:
> 
>  1) it works
>  2) I think it is the right to do
>  3) it needs a lot more work
>  4) we can't block the online repair work on it
> 
> so I guess we'll need to go with the approach in this patch for now,
> maybe with a better commit log, and I'll look into finishing this work
> some time in the future.

<nod> I think an xarray version of this would be less clunky than
xfs_delwri_buf with three pointers.

Also note Dave's comments on the v25 posting of this series:
https://lore.kernel.org/linux-xfs/ZJJa7Cnni0mb%2F9sN@dread.disaster.area/

--D
Christoph Hellwig Nov. 30, 2023, 4:33 a.m. UTC | #6
On Tue, Nov 28, 2023 at 09:07:34AM -0800, Darrick J. Wong wrote:
> > so I guess we'll need to go with the approach in this patch for now,
> > maybe with a better commit log, and I'll look into finishing this work
> > some time in the future.
> 
> <nod> I think an xarray version of this would be less clunky than
> xfs_delwri_buf with three pointers.

It would, but xarray indices are unsigned longs, so we couldn't simply
use the block number for it.  So unless we decided to give up on
building XFS on 32-bit kernels (which would nice for other reasons)
it would need something more complicated.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index dd75e208b543e..29e3f8ccb1852 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -342,9 +342,7 @@  xfs_btree_bload_drop_buf(
 	if (*bpp == NULL)
 		return;
 
-	if (!xfs_buf_delwri_queue(*bpp, buffers_list))
-		ASSERT(0);
-
+	xfs_buf_delwri_queue_here(*bpp, buffers_list);
 	xfs_buf_relse(*bpp);
 	*bpp = NULL;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 545c7991b9b58..f88b1334b420c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2049,6 +2049,14 @@  xfs_alloc_buftarg(
 	return NULL;
 }
 
+static inline void
+xfs_buf_list_del(
+	struct xfs_buf		*bp)
+{
+	list_del_init(&bp->b_list);
+	wake_up_var(&bp->b_list);
+}
+
 /*
  * Cancel a delayed write list.
  *
@@ -2066,7 +2074,7 @@  xfs_buf_delwri_cancel(
 
 		xfs_buf_lock(bp);
 		bp->b_flags &= ~_XBF_DELWRI_Q;
-		list_del_init(&bp->b_list);
+		xfs_buf_list_del(bp);
 		xfs_buf_relse(bp);
 	}
 }
@@ -2119,6 +2127,37 @@  xfs_buf_delwri_queue(
 	return true;
 }
 
+/*
+ * Queue a buffer to this delwri list as part of a data integrity operation.
+ * If the buffer is on any other delwri list, we'll wait for that to clear
+ * so that the caller can submit the buffer for IO and wait for the result.
+ * Callers must ensure the buffer is not already on the list.
+ */
+void
+xfs_buf_delwri_queue_here(
+	struct xfs_buf		*bp,
+	struct list_head	*buffer_list)
+{
+	/*
+	 * We need this buffer to end up on the /caller's/ delwri list, not any
+	 * old list.  This can happen if the buffer is marked stale (which
+	 * clears DELWRI_Q) after the AIL queues the buffer to its list but
+	 * before the AIL has a chance to submit the list.
+	 */
+	while (!list_empty(&bp->b_list)) {
+		xfs_buf_unlock(bp);
+		wait_var_event(&bp->b_list, list_empty(&bp->b_list));
+		xfs_buf_lock(bp);
+	}
+
+	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+
+	/* This buffer is uptodate; don't let it get reread. */
+	bp->b_flags |= XBF_DONE;
+
+	xfs_buf_delwri_queue(bp, buffer_list);
+}
+
 /*
  * Compare function is more complex than it needs to be because
  * the return value is only 32 bits and we are doing comparisons
@@ -2181,7 +2220,7 @@  xfs_buf_delwri_submit_buffers(
 		 * reference and remove it from the list here.
 		 */
 		if (!(bp->b_flags & _XBF_DELWRI_Q)) {
-			list_del_init(&bp->b_list);
+			xfs_buf_list_del(bp);
 			xfs_buf_relse(bp);
 			continue;
 		}
@@ -2201,7 +2240,7 @@  xfs_buf_delwri_submit_buffers(
 			list_move_tail(&bp->b_list, wait_list);
 		} else {
 			bp->b_flags |= XBF_ASYNC;
-			list_del_init(&bp->b_list);
+			xfs_buf_list_del(bp);
 		}
 		__xfs_buf_submit(bp, false);
 	}
@@ -2255,7 +2294,7 @@  xfs_buf_delwri_submit(
 	while (!list_empty(&wait_list)) {
 		bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-		list_del_init(&bp->b_list);
+		xfs_buf_list_del(bp);
 
 		/*
 		 * Wait on the locked buffer, check for errors and unlock and
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c86e164196568..b470de08a46ca 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -319,6 +319,7 @@  extern void xfs_buf_stale(struct xfs_buf *bp);
 /* Delayed Write Buffer Routines */
 extern void xfs_buf_delwri_cancel(struct list_head *);
 extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
+void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl);
 extern int xfs_buf_delwri_submit(struct list_head *);
 extern int xfs_buf_delwri_submit_nowait(struct list_head *);
 extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);