Message ID | 170086926593.2770816.5504104328549141972.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: prepare repair for bulk loading | expand |
The code makes me feel we're fixing the wrong thing, but I have to admin I don't fully understand it. Let me go step by step. > While stress-testing online repair of btrees, I noticed periodic > assertion failures from the buffer cache about buffer readers > encountering buffers with DELWRI_Q set, even though the btree bulk load > had already committed and the buffer itself wasn't on any delwri list. What assert do these buffer reader hit? The two I can spot that are in the read path in the broader sense are: 1) in xfs_buf_find_lock for stale buffers. 2) in __xfs_buf_submit just before I/O submission > I traced this to a misunderstanding of how the delwri lists work, > particularly with regards to the AIL's buffer list. If a buffer is > logged and committed, the buffer can end up on that AIL buffer list. If > btree repairs are run twice in rapid succession, it's possible that the > first repair will invalidate the buffer and free it before the next time > the AIL wakes up. This clears DELWRI_Q from the buffer state. Where "this clears" is xfs_buf_stale called from xfs_btree_free_block via xfs_trans_binval? > If the second repair allocates the same block, it will then recycle the > buffer to start writing the new btree block. If my above theory is correct: how do we end up reusing a stale buffer? If not, what am I misunderstanding above? > Meanwhile, if the AIL > wakes up and walks the buffer list, it will ignore the buffer because it > can't lock it, and go back to sleep. And I think this is where the trouble starts - we have a buffer that is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared, it is stale and we then reuse it. I don't think we just need to kick it off the delwri list just for btree staging, but in general. > > When the second repair calls delwri_queue to put the buffer on the > list of buffers to write before committing the new btree, it will set > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's > buffer list, it won't add it to the bulkload buffer's list. > > This is incorrect, because the bulkload caller relies on delwri_submit > to ensure that all the buffers have been sent to disk /before/ > committing the new btree root pointer. This ordering requirement is > required for data consistency. > > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally > drop it, so the next thread to walk through the btree will trip over a > debug assertion on that flag. Where do it finally drop it? > To fix this, create a new function that waits for the buffer to be > removed from any other delwri lists before adding the buffer to the > caller's delwri list. By waiting for the buffer to clear both the > delwri list and any potential delwri wait list, we can be sure that > repair will initiate writes of all buffers and report all write errors > back to userspace instead of committing the new structure. If my understanding above is correct this just papers over the bug that a buffer that is marked stale and can be reused for something else is left on a delwri list. I've entirely thought about all the consequence, but here is what I'd try: - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q call your new wait code instead of asserting (probably only for the !trylock case) - make sure we don't leak DELWRI_Q
On Fri, Nov 24, 2023 at 09:49:28PM -0800, Christoph Hellwig wrote: > The code makes me feel we're fixing the wrong thing, but I have to > admin I don't fully understand it. Let me go step by step. > > > While stress-testing online repair of btrees, I noticed periodic > > assertion failures from the buffer cache about buffer readers > > encountering buffers with DELWRI_Q set, even though the btree bulk load > > had already committed and the buffer itself wasn't on any delwri list. > > What assert do these buffer reader hit? The two I can spot that are in > the read path in the broader sense are: > > 1) in xfs_buf_find_lock for stale buffers. > 2) in __xfs_buf_submit just before I/O submission The second assert: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit delwri_submit # oops Though there's more to this. > > I traced this to a misunderstanding of how the delwri lists work, > > particularly with regards to the AIL's buffer list. If a buffer is > > logged and committed, the buffer can end up on that AIL buffer list. If > > btree repairs are run twice in rapid succession, it's possible that the > > first repair will invalidate the buffer and free it before the next time > > the AIL wakes up. This clears DELWRI_Q from the buffer state. > > Where "this clears" is xfs_buf_stale called from xfs_btree_free_block > via xfs_trans_binval? Yes. I'll rework the last sentence to read: "Marking the buffer stale clears DELWRI_Q from the buffer state without removing the buffer from its delwri list." > > If the second repair allocates the same block, it will then recycle the > > buffer to start writing the new btree block. > > If my above theory is correct: how do we end up reusing a stale buffer? > If not, what am I misunderstanding above? If we free a metadata buffer, we'll mark it stale, update the bnobt, and add an entry to the extent busy list. If a later repair finds: 1. The same free space in the bnobt; 2. The free space exactly coincides with one extent busy list entry; 3. The entry isn't in the middle of discarding the block; Then the allocator will remove the extent busy list entry and let us have the space. At that point we could have a stale buffer that's also on one of the AIL's lists: AIL: Repair0: Repair1: pin buffer X delwri_queue: set DELWRI_Q add to delwri list stale buf X: clear DELWRI_Q does not clear b_list free space X commit find free space X get buffer rewrite buffer delwri_queue: set DELWRI_Q already on a list, do not add commit BAD: committed tree root before all blocks written delwri_submit # too late now I could demonstrate this by injecting dmerror while delwri(ting) the new btree blocks, and observing that some of the EIOs would not trigger the rollback but instead would trip IO errors in the AIL. > > wakes up and walks the buffer list, it will ignore the buffer because it > > can't lock it, and go back to sleep. > > And I think this is where the trouble starts - we have a buffer that > is left on some delwri list, but with the _XBF_DELWRI_Q flag cleared, > it is stale and we then reuse it. I don't think we just need to kick > it off the delwri list just for btree staging, but in general. ...but how to do that? We don't know whose delwri list the buffer's on, let alone how to lock the list so that we can remove the buffer from that list. (Oh, you have some suggestions below.) > > When the second repair calls delwri_queue to put the buffer on the > > list of buffers to write before committing the new btree, it will set > > DELWRI_Q again, but since the buffer hasn't been removed from the AIL's > > buffer list, it won't add it to the bulkload buffer's list. > > > > This is incorrect, because the bulkload caller relies on delwri_submit > > to ensure that all the buffers have been sent to disk /before/ > > committing the new btree root pointer. This ordering requirement is > > required for data consistency. > > > > Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally > > drop it, so the next thread to walk through the btree will trip over a > > debug assertion on that flag. > > Where do it finally drop it? TBH, it's been so long that I can't remember anymore. :( > > To fix this, create a new function that waits for the buffer to be > > removed from any other delwri lists before adding the buffer to the > > caller's delwri list. By waiting for the buffer to clear both the > > delwri list and any potential delwri wait list, we can be sure that > > repair will initiate writes of all buffers and report all write errors > > back to userspace instead of committing the new structure. > > If my understanding above is correct this just papers over the bug > that a buffer that is marked stale and can be reused for something > else is left on a delwri list. I'm not sure it's a bug for *most* code that encounters it. Regular transactions don't directly use the delwri lists, so it will never see that at all. If the buffer gets rewritten and logged, then it'll just end up on the AIL's delwri buffer list again. The other delwri_submit users don't seem to care if the buffer gets written directly by delwri_submit or by the AIL. In the first case, the caller will get the EIO and force a shutdown; in the second case, the AIL will shut down the log. Weirdly, xfs_bwrite clears DELWRI_Q but doesn't necessary shut down the fs if the write fails. Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does it mean "b_list is active"? Or does it merely mean "another thread will write this buffer via b_list if nobody gets there first"? I think it's the second, since it's easy enough to list_empty(). It's only repair that requires this new behavior that all new buffers have to be written through the delwri list, and only because it actually /can/ cancel the transaction by finishing the autoreap EFIs. > I've entirely thought about all the > consequence, but here is what I'd try: > > - if xfs_buf_find_lock finds a stale buffer with _XBF_DELWRI_Q > call your new wait code instead of asserting (probably only > for the !trylock case) > - make sure we don't leak DELWRI_Q Way back when I first discovered this, my first impulse was to make xfs_buf_stale wait for DELWRI_Q to clear. That IIRC didn't work because a transaction could hold an inode that the AIL will need to lock. I think xfs_buf_find_lock would have the same problem. Seeing as repair is the only user with the requirement that it alone can issue writes for the delwri list buffers, that's why I went with what's in this patch. Thank you for persevering so far! :D --D
> > If my understanding above is correct this just papers over the bug > > that a buffer that is marked stale and can be reused for something > > else is left on a delwri list. > > I'm not sure it's a bug for *most* code that encounters it. Regular > transactions don't directly use the delwri lists, so it will never see > that at all. If the buffer gets rewritten and logged, then it'll just > end up on the AIL's delwri buffer list again. Where "regular transactions" means data I/O, yes. All normal metadata I/O eventually ends up on a delwri list. And we want it on that delwri list after actually updating the contents. > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > it mean "b_list is active"? Or does it merely mean "another thread will > write this buffer via b_list if nobody gets there first"? I think it's > the second, since it's easy enough to list_empty(). I think it is misnamed and not clearly defined, and we should probably fix that. Note that least the current list_empty() usage isn't quite correct either without a lock held by the delwri list owner. We at least need a list_empty_careful for a racey but still save check. Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. Except for various asserts we mostly use it in reverse polarity, that is to cancel delwri writeout for stale buffers. How about just skipping XBF_STALE buffers on the delwri list directly and not bother with DELWRI_Q? With that we can then support multiple delwri lists that don't get into each others way using say an allocating xarray instead of the invase list and avoid this whole mess. Let me try to prototype this.. > It's only repair that requires this new behavior that all new buffers > have to be written through the delwri list, and only because it actually > /can/ cancel the transaction by finishing the autoreap EFIs. Right now yes. But I think the delwri behavior is a land mine, and this just happens to be the first user to trigger it. Edit: while looking through the DELWRI_Q usage I noticed xfs_qm_flush_one, which seems to deal with what is at least a somewhat related issue based on the comments there. > Way back when I first discovered this, my first impulse was to make > xfs_buf_stale wait for DELWRI_Q to clear. That IIRC didn't work because > a transaction could hold an inode that the AIL will need to lock. I > think xfs_buf_find_lock would have the same problem. Yes, that makes sense. Would be great to document such details in the commit message..
On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote: > > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > > it mean "b_list is active"? Or does it merely mean "another thread will > > write this buffer via b_list if nobody gets there first"? I think it's > > the second, since it's easy enough to list_empty(). > > I think it is misnamed and not clearly defined, and we should probably > fix that. Note that least the current list_empty() usage isn't quite > correct either without a lock held by the delwri list owner. We at > least need a list_empty_careful for a racey but still save check. > > Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. > Except for various asserts we mostly use it in reverse polarity, that is > to cancel delwri writeout for stale buffers. How about just skipping > XBF_STALE buffers on the delwri list directly and not bother with > DELWRI_Q? With that we can then support multiple delwri lists that > don't get into each others way using say an allocating xarray instead > of the invase list and avoid this whole mess. > > Let me try to prototype this.. Ok, I spent half the day prototyping this and it passes basic sanity checks: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists My conclusions from that is: 1) it works 2) I think it is the right to do 3) it needs a lot more work 4) we can't block the online repair work on it so I guess we'll need to go with the approach in this patch for now, maybe with a better commit log, and I'll look into finishing this work some time in the future.
On Tue, Nov 28, 2023 at 07:18:22AM -0800, Christoph Hellwig wrote: > On Mon, Nov 27, 2023 at 11:13:40PM -0800, Christoph Hellwig wrote: > > > Every time I revisit this patch I wonder if DELWRI_Q is misnamed -- does > > > it mean "b_list is active"? Or does it merely mean "another thread will > > > write this buffer via b_list if nobody gets there first"? I think it's > > > the second, since it's easy enough to list_empty(). > > > > I think it is misnamed and not clearly defined, and we should probably > > fix that. Note that least the current list_empty() usage isn't quite > > correct either without a lock held by the delwri list owner. We at > > least need a list_empty_careful for a racey but still save check. > > > > Thinking out loud I wonder if we can just kill XBF_DELWRI_Q entirely. > > Except for various asserts we mostly use it in reverse polarity, that is > > to cancel delwri writeout for stale buffers. How about just skipping > > XBF_STALE buffers on the delwri list directly and not bother with > > DELWRI_Q? With that we can then support multiple delwri lists that > > don't get into each others way using say an allocating xarray instead > > of the invase list and avoid this whole mess. > > > > Let me try to prototype this.. > > Ok, I spent half the day prototyping this and it passes basic sanity > checks: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/xfs-multiple-delwri-lists > > My conclusions from that is: > > 1) it works > 2) I think it is the right to do > 3) it needs a lot more work > 4) we can't block the online repair work on it > > so I guess we'll need to go with the approach in this patch for now, > maybe with a better commit log, and I'll look into finishing this work > some time in the future. <nod> I think an xarray version of this would be less clunky than xfs_delwri_buf with three pointers. Also note Dave's comments on the v25 posting of this series: https://lore.kernel.org/linux-xfs/ZJJa7Cnni0mb%2F9sN@dread.disaster.area/ --D
On Tue, Nov 28, 2023 at 09:07:34AM -0800, Darrick J. Wong wrote: > > so I guess we'll need to go with the approach in this patch for now, > > maybe with a better commit log, and I'll look into finishing this work > > some time in the future. > > <nod> I think an xarray version of this would be less clunky than > xfs_delwri_buf with three pointers. It would, but xarray indices are unsigned longs, so we couldn't simply use the block number for it. So unless we decided to give up on building XFS on 32-bit kernels (which would nice for other reasons) it would need something more complicated.
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c index dd75e208b543e..29e3f8ccb1852 100644 --- a/fs/xfs/libxfs/xfs_btree_staging.c +++ b/fs/xfs/libxfs/xfs_btree_staging.c @@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf( if (*bpp == NULL) return; - if (!xfs_buf_delwri_queue(*bpp, buffers_list)) - ASSERT(0); - + xfs_buf_delwri_queue_here(*bpp, buffers_list); xfs_buf_relse(*bpp); *bpp = NULL; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 545c7991b9b58..f88b1334b420c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2049,6 +2049,14 @@ xfs_alloc_buftarg( return NULL; } +static inline void +xfs_buf_list_del( + struct xfs_buf *bp) +{ + list_del_init(&bp->b_list); + wake_up_var(&bp->b_list); +} + /* * Cancel a delayed write list. * @@ -2066,7 +2074,7 @@ xfs_buf_delwri_cancel( xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); } } @@ -2119,6 +2127,37 @@ xfs_buf_delwri_queue( return true; } +/* + * Queue a buffer to this delwri list as part of a data integrity operation. + * If the buffer is on any other delwri list, we'll wait for that to clear + * so that the caller can submit the buffer for IO and wait for the result. + * Callers must ensure the buffer is not already on the list. + */ +void +xfs_buf_delwri_queue_here( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + /* + * We need this buffer to end up on the /caller's/ delwri list, not any + * old list. This can happen if the buffer is marked stale (which + * clears DELWRI_Q) after the AIL queues the buffer to its list but + * before the AIL has a chance to submit the list. + */ + while (!list_empty(&bp->b_list)) { + xfs_buf_unlock(bp); + wait_var_event(&bp->b_list, list_empty(&bp->b_list)); + xfs_buf_lock(bp); + } + + ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); + + /* This buffer is uptodate; don't let it get reread. */ + bp->b_flags |= XBF_DONE; + + xfs_buf_delwri_queue(bp, buffer_list); +} + /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons @@ -2181,7 +2220,7 @@ xfs_buf_delwri_submit_buffers( * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); xfs_buf_relse(bp); continue; } @@ -2201,7 +2240,7 @@ xfs_buf_delwri_submit_buffers( list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); } __xfs_buf_submit(bp, false); } @@ -2255,7 +2294,7 @@ xfs_buf_delwri_submit( while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); - list_del_init(&bp->b_list); + xfs_buf_list_del(bp); /* * Wait on the locked buffer, check for errors and unlock and diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index c86e164196568..b470de08a46ca 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -319,6 +319,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); +void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);