diff mbox series

[v2,08/13] xfs: elide the AIL lock on log item failure tracking

Message ID 20200422175429.38957-9-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series xfs: flush related error handling cleanups | expand

Commit Message

Brian Foster April 22, 2020, 5:54 p.m. UTC
The log item failed flag is used to indicate a log item was flushed
but the underlying buffer was not successfully written to disk. If
the error configuration allows for writeback retries, xfsaild uses
the flag to resubmit the underlying buffer without acquiring the
flush lock of the item.

The flag is currently set and cleared under the AIL lock and buffer
lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
completion time. The flag is checked at xfsaild push time under AIL
lock and cleared under buffer lock before resubmission. If I/O
eventually succeeds, the flag is cleared in the _done() handler for
the associated item type, again under AIL lock and buffer lock.

As far as I can tell, the only reason for holding the AIL lock
across sets/clears is to manage consistency between the log item
bitop state and the temporary buffer pointer that is attached to the
log item. The bit itself is used to manage consistency of the
attached buffer, but is not enough to guarantee the buffer is still
attached by the time xfsaild attempts to access it. However since
failure state is always set or cleared under buffer lock (either via
I/O completion or xfsaild), this particular case can be handled at
item push time by verifying failure state once under buffer lock.

Remove the AIL lock protection from the various bits of log item
failure state management and simplify the surrounding code where
applicable. Use the buffer lock in the xfsaild resubmit code to
detect failure state changes and temporarily treat the item as
locked.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c   |  4 ----
 fs/xfs/xfs_dquot.c      | 41 +++++++++++++++--------------------------
 fs/xfs/xfs_inode_item.c | 11 +++--------
 fs/xfs/xfs_trans_ail.c  | 12 ++++++++++--
 fs/xfs/xfs_trans_priv.h |  5 -----
 5 files changed, 28 insertions(+), 45 deletions(-)

Comments

Dave Chinner April 23, 2020, 5:59 a.m. UTC | #1
On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> The log item failed flag is used to indicate a log item was flushed
> but the underlying buffer was not successfully written to disk. If
> the error configuration allows for writeback retries, xfsaild uses
> the flag to resubmit the underlying buffer without acquiring the
> flush lock of the item.
> 
> The flag is currently set and cleared under the AIL lock and buffer
> lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> completion time. The flag is checked at xfsaild push time under AIL
> lock and cleared under buffer lock before resubmission. If I/O
> eventually succeeds, the flag is cleared in the _done() handler for
> the associated item type, again under AIL lock and buffer lock.

Actually, I think you've missed the relevant lock here: the item's
flush lock. The XFS_LI_FAILED flag is item flush state, and flush
state is protected by the flush lock. When the item has been flushed
and attached to the buffer for completion callbacks, the flush lock
context gets handed to the buffer.

i.e. the buffer owns the flush lock and so while that buffer is
locked for IO we know that the item flush state (and hence the
XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
lock.

(Note: this is how xfs_ifree_cluster() works - it grabs the buffer
lock then walks the items on the buffer and changes the callback
functions because those items are flush locked and hence holding the
buffer lock gives exclusive access to the flush state of those
items....)

> As far as I can tell, the only reason for holding the AIL lock
> across sets/clears is to manage consistency between the log item
> bitop state and the temporary buffer pointer that is attached to the
> log item. The bit itself is used to manage consistency of the
> attached buffer, but is not enough to guarantee the buffer is still
> attached by the time xfsaild attempts to access it.

Correct. The guarantee that the buffer is still attached to the log
item is what the AIL lock provides us with.

> However since
> failure state is always set or cleared under buffer lock (either via
> I/O completion or xfsaild), this particular case can be handled at
> item push time by verifying failure state once under buffer lock.

In theory, yes, but there's a problem before you get that buffer
lock. That is: what serialises access to lip->li_buf?

The xfsaild does not hold a reference to the buffer and, without the
AIL lock to provide synchronisation, the log item reference to the
buffer can be dropped asynchronously by buffer IO completion. Hence
the buffer could be freed between the xfsaild reading lip->li_buf
and calling xfs_buf_trylock(bp). i.e. this introduces a
use-after-free race condition on the initial buffer access.

IOWs, the xfsaild cannot access lip->li_buf safely unless the
set/clear operations are protected by the same lock the xfsaild
holds. The xfsaild holds neither the buffer lock, a buffer reference
or an item flush lock, hence it's the AIL lock or nothing...

Cheers,

Dave.
Brian Foster April 23, 2020, 2:36 p.m. UTC | #2
On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > The log item failed flag is used to indicate a log item was flushed
> > but the underlying buffer was not successfully written to disk. If
> > the error configuration allows for writeback retries, xfsaild uses
> > the flag to resubmit the underlying buffer without acquiring the
> > flush lock of the item.
> > 
> > The flag is currently set and cleared under the AIL lock and buffer
> > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > completion time. The flag is checked at xfsaild push time under AIL
> > lock and cleared under buffer lock before resubmission. If I/O
> > eventually succeeds, the flag is cleared in the _done() handler for
> > the associated item type, again under AIL lock and buffer lock.
> 
> Actually, I think you've missed the relevant lock here: the item's
> flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> state is protected by the flush lock. When the item has been flushed
> and attached to the buffer for completion callbacks, the flush lock
> context gets handed to the buffer.
> 
> i.e. the buffer owns the flush lock and so while that buffer is
> locked for IO we know that the item flush state (and hence the
> XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> lock.
> 

Right..

> (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> lock then walks the items on the buffer and changes the callback
> functions because those items are flush locked and hence holding the
> buffer lock gives exclusive access to the flush state of those
> items....)
> 
> > As far as I can tell, the only reason for holding the AIL lock
> > across sets/clears is to manage consistency between the log item
> > bitop state and the temporary buffer pointer that is attached to the
> > log item. The bit itself is used to manage consistency of the
> > attached buffer, but is not enough to guarantee the buffer is still
> > attached by the time xfsaild attempts to access it.
> 
> Correct. The guarantee that the buffer is still attached to the log
> item is what the AIL lock provides us with.
> 
> > However since
> > failure state is always set or cleared under buffer lock (either via
> > I/O completion or xfsaild), this particular case can be handled at
> > item push time by verifying failure state once under buffer lock.
> 
> In theory, yes, but there's a problem before you get that buffer
> lock. That is: what serialises access to lip->li_buf?
> 

That's partly what I was referring to above by the push time changes.
This patch was an attempt to replace reliance on ail_lock with a push
time sequence that would serialize access to a failed buffer
(essentially relying on the failed bit). Whether it's correct or not is
another matter... ;)

> The xfsaild does not hold a reference to the buffer and, without the
> AIL lock to provide synchronisation, the log item reference to the
> buffer can be dropped asynchronously by buffer IO completion. Hence
> the buffer could be freed between the xfsaild reading lip->li_buf
> and calling xfs_buf_trylock(bp). i.e. this introduces a
> use-after-free race condition on the initial buffer access.
> 

Hmm.. the log item holds a temporary reference to the buffer when the
failed state is set. On submission, xfsaild queues the failed buffer
(new ref for the delwri queue), clears the failed state and drops the
failure reference of every failed item that is attached. xfsaild is also
the only task that knows how to process LI_FAILED items, so I don't
think we'd ever race with a failed buffer becoming "unfailed" from
xfsaild (which is the scenario where a buffer could disappear from I/O
completion). In some sense, clearing LI_FAILED of an item is essentially
retaking ownership of the item flush lock and hold of the underlying
buffer, but the existing code is certainly not written that way.

The issue I was wrestling with wrt to this patch was primarily
maintaining consistency between the bit and the assignment of the
pointer on a failing I/O. E.g., the buffer pointer is set by the task
that sets the bit, but xfsaild checking the bit doesn't guarantee the
pointer had been set yet. I did add a post-buffer lock LI_FAILED check
as well, but that probably could have been an assert.

> IOWs, the xfsaild cannot access lip->li_buf safely unless the
> set/clear operations are protected by the same lock the xfsaild
> holds. The xfsaild holds neither the buffer lock, a buffer reference
> or an item flush lock, hence it's the AIL lock or nothing...
> 

Yeah, that was my impression from reading the code. I realize from this
discussion that this doesn't actually simplify the logic. That was the
primary goal, so I think I need to revisit the approach here. Even if
this is correct (which I'm still not totally sure of), it's fragile in
the sense that it partly relies on single-threadedness of xfsaild. I
initially thought about adding a log item spinlock for this kind of
thing, but it didn't (and still doesn't) seem appropriate to burden
every log item in the system with a spinlock for the rare case of buffer
I/O errors.

I mentioned in an earlier patch that it would be nice to consider
removing ->li_buf entirely but hadn't thought it through. That might
alleviate the need to serialize the pointer and the state in the first
place as well as remove the subtle ordering requirement from the
resubmit code to manage the buffer reference. Suppose we relied on
LI_FAILED to represent the buffer hold + flush lock, and then relied on
the hold to facilitate an in-core lookup from xfsaild. For example:

xfs_set_li_failed(lip, bp)
{
	/* keep the backing buffer in-core so long as the item is failed */
	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_hold(bp);
}

xfs_clear_li_failed(lip, bp)
{
	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags))
		xfs_buf_rele(bp);
}

/*
 * helper to get a mapping from a buffer backed log item and use it to
 * find an in-core buf
 */
xfs_item_to_bp(lip)
{
	if (inode item) {
		...
		blk = ip->imap->im_blkno;
		len = ip->imap->im_len;
	} else if (dquot item) {
		...
		blk = dqp->q_blkno;
		len = mp->m_quotainfo->qi_dqchunklen;
	}

	return xfs_buf_incore(blk, len, XBF_TRYLOCK);
}

xfsaild_resubmit_item(lip)
{
	...

	bp = xfs_item_to_bp(lip);
	if (!bp)
		return XFS_ITEM_LOCKED;
	if (!test_bit(XFS_LI_FAILED, &lip->li_flags))
		goto out_unlock;

	/* drop all failure refs */
	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
		xfs_clear_li_failed(lip, bp);

	xfs_buf_delwri_queue(...);

	xfs_buf_relse(bp);
	return XFS_ITEM_SUCCESS;
}

I think that would push everything under the buffer lock (and remove
->li_buf) with the caveat that we'd have to lock the inode/dquot to get
the buffer. Any thoughts on something like that? Note that at this point
I might drop this patch from the series regardless due to the complexity
mismatch and consider it separately.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner April 23, 2020, 9:38 p.m. UTC | #3
On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > The log item failed flag is used to indicate a log item was flushed
> > > but the underlying buffer was not successfully written to disk. If
> > > the error configuration allows for writeback retries, xfsaild uses
> > > the flag to resubmit the underlying buffer without acquiring the
> > > flush lock of the item.
> > > 
> > > The flag is currently set and cleared under the AIL lock and buffer
> > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > completion time. The flag is checked at xfsaild push time under AIL
> > > lock and cleared under buffer lock before resubmission. If I/O
> > > eventually succeeds, the flag is cleared in the _done() handler for
> > > the associated item type, again under AIL lock and buffer lock.
> > 
> > Actually, I think you've missed the relevant lock here: the item's
> > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > state is protected by the flush lock. When the item has been flushed
> > and attached to the buffer for completion callbacks, the flush lock
> > context gets handed to the buffer.
> > 
> > i.e. the buffer owns the flush lock and so while that buffer is
> > locked for IO we know that the item flush state (and hence the
> > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > lock.
> > 
> 
> Right..
> 
> > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > lock then walks the items on the buffer and changes the callback
> > functions because those items are flush locked and hence holding the
> > buffer lock gives exclusive access to the flush state of those
> > items....)
> > 
> > > As far as I can tell, the only reason for holding the AIL lock
> > > across sets/clears is to manage consistency between the log item
> > > bitop state and the temporary buffer pointer that is attached to the
> > > log item. The bit itself is used to manage consistency of the
> > > attached buffer, but is not enough to guarantee the buffer is still
> > > attached by the time xfsaild attempts to access it.
> > 
> > Correct. The guarantee that the buffer is still attached to the log
> > item is what the AIL lock provides us with.
> > 
> > > However since
> > > failure state is always set or cleared under buffer lock (either via
> > > I/O completion or xfsaild), this particular case can be handled at
> > > item push time by verifying failure state once under buffer lock.
> > 
> > In theory, yes, but there's a problem before you get that buffer
> > lock. That is: what serialises access to lip->li_buf?
> > 
> 
> That's partly what I was referring to above by the push time changes.
> This patch was an attempt to replace reliance on ail_lock with a push
> time sequence that would serialize access to a failed buffer
> (essentially relying on the failed bit). Whether it's correct or not is
> another matter... ;)
> 
> > The xfsaild does not hold a reference to the buffer and, without the
> > AIL lock to provide synchronisation, the log item reference to the
> > buffer can be dropped asynchronously by buffer IO completion. Hence
> > the buffer could be freed between the xfsaild reading lip->li_buf
> > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > use-after-free race condition on the initial buffer access.
> > 
> 
> Hmm.. the log item holds a temporary reference to the buffer when the
> failed state is set. On submission, xfsaild queues the failed buffer
> (new ref for the delwri queue), clears the failed state and drops the
> failure reference of every failed item that is attached. xfsaild is also
> the only task that knows how to process LI_FAILED items, so I don't
> think we'd ever race with a failed buffer becoming "unfailed" from
> xfsaild (which is the scenario where a buffer could disappear from I/O
> completion).

Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
so the first inode in the AIL triggers the resubmit and starts the
IO. the AIL runs again, coming across another inode on the buffer
further into the AIL. That inode has been modified in memory while
the flush was in progress, so it no longer gets removed from the AIL
by the IO completion. Instead, xfs_clear_li_failed() is called from
xfs_iflush_done() and the buffer is removed from the logitem and the
reference dropped.

> In some sense, clearing LI_FAILED of an item is essentially
> retaking ownership of the item flush lock and hold of the underlying
> buffer, but the existing code is certainly not written that way.

Only IO completion clears the LI_FAILED state of the item, not the
xfsaild. IO completion already owns the flush lock - the xfsaild
only holds it long enough to flush an inode to the buffer and then
give the flush lock to the buffer.

Also, we clear LI_FAILED when removing the item from the AIL under
the AIL lock, so the only case here that we are concerned about is
the above "inode was relogged while being flushed" situation. THis
situation is rare, failed buffers are rare, and so requiring the AIL
lock to be held is a performance limiting factor here...

> The issue I was wrestling with wrt to this patch was primarily
> maintaining consistency between the bit and the assignment of the
> pointer on a failing I/O. E.g., the buffer pointer is set by the task
> that sets the bit, but xfsaild checking the bit doesn't guarantee the
> pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> as well, but that probably could have been an assert.

SO you've been focussing on races that may occur when the setting of
the li_buf, but I'm looking at races involving the clearing of the
li_buf pointer. :)

> > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > set/clear operations are protected by the same lock the xfsaild
> > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > or an item flush lock, hence it's the AIL lock or nothing...
> > 
> 
> Yeah, that was my impression from reading the code. I realize from this
> discussion that this doesn't actually simplify the logic. That was the
> primary goal, so I think I need to revisit the approach here. Even if
> this is correct (which I'm still not totally sure of), it's fragile in
> the sense that it partly relies on single-threadedness of xfsaild. I
> initially thought about adding a log item spinlock for this kind of
> thing, but it didn't (and still doesn't) seem appropriate to burden
> every log item in the system with a spinlock for the rare case of buffer
> I/O errors.

I feel this is all largely a moot, because my patches result in this
this whole problem of setting/clearing the li_buf for failed buffers
go away altogether. Hence I would suggest that this patch gets
dropped for now, because getting rid of the AIL lock is much less
troublesome once LI_FAILED is no long dependent on the inode/dquot
log item taking temporary references to the underlying buffer....

> 
> I mentioned in an earlier patch that it would be nice to consider
> removing ->li_buf entirely but hadn't thought it through.

I'm going the opposite way entirely, such that LI_FAILED doesn't
have any special state associated with it at all...

Cheers,

Dave.
Brian Foster April 24, 2020, 11:14 a.m. UTC | #4
On Fri, Apr 24, 2020 at 07:38:39AM +1000, Dave Chinner wrote:
> On Thu, Apr 23, 2020 at 10:36:37AM -0400, Brian Foster wrote:
> > On Thu, Apr 23, 2020 at 03:59:19PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 22, 2020 at 01:54:24PM -0400, Brian Foster wrote:
> > > > The log item failed flag is used to indicate a log item was flushed
> > > > but the underlying buffer was not successfully written to disk. If
> > > > the error configuration allows for writeback retries, xfsaild uses
> > > > the flag to resubmit the underlying buffer without acquiring the
> > > > flush lock of the item.
> > > > 
> > > > The flag is currently set and cleared under the AIL lock and buffer
> > > > lock in the ->iop_error() callback, invoked via ->b_iodone() at I/O
> > > > completion time. The flag is checked at xfsaild push time under AIL
> > > > lock and cleared under buffer lock before resubmission. If I/O
> > > > eventually succeeds, the flag is cleared in the _done() handler for
> > > > the associated item type, again under AIL lock and buffer lock.
> > > 
> > > Actually, I think you've missed the relevant lock here: the item's
> > > flush lock. The XFS_LI_FAILED flag is item flush state, and flush
> > > state is protected by the flush lock. When the item has been flushed
> > > and attached to the buffer for completion callbacks, the flush lock
> > > context gets handed to the buffer.
> > > 
> > > i.e. the buffer owns the flush lock and so while that buffer is
> > > locked for IO we know that the item flush state (and hence the
> > > XFS_LI_FAILED flag) is exclusively owned by the holder of the buffer
> > > lock.
> > > 
> > 
> > Right..
> > 
> > > (Note: this is how xfs_ifree_cluster() works - it grabs the buffer
> > > lock then walks the items on the buffer and changes the callback
> > > functions because those items are flush locked and hence holding the
> > > buffer lock gives exclusive access to the flush state of those
> > > items....)
> > > 
> > > > As far as I can tell, the only reason for holding the AIL lock
> > > > across sets/clears is to manage consistency between the log item
> > > > bitop state and the temporary buffer pointer that is attached to the
> > > > log item. The bit itself is used to manage consistency of the
> > > > attached buffer, but is not enough to guarantee the buffer is still
> > > > attached by the time xfsaild attempts to access it.
> > > 
> > > Correct. The guarantee that the buffer is still attached to the log
> > > item is what the AIL lock provides us with.
> > > 
> > > > However since
> > > > failure state is always set or cleared under buffer lock (either via
> > > > I/O completion or xfsaild), this particular case can be handled at
> > > > item push time by verifying failure state once under buffer lock.
> > > 
> > > In theory, yes, but there's a problem before you get that buffer
> > > lock. That is: what serialises access to lip->li_buf?
> > > 
> > 
> > That's partly what I was referring to above by the push time changes.
> > This patch was an attempt to replace reliance on ail_lock with a push
> > time sequence that would serialize access to a failed buffer
> > (essentially relying on the failed bit). Whether it's correct or not is
> > another matter... ;)
> > 
> > > The xfsaild does not hold a reference to the buffer and, without the
> > > AIL lock to provide synchronisation, the log item reference to the
> > > buffer can be dropped asynchronously by buffer IO completion. Hence
> > > the buffer could be freed between the xfsaild reading lip->li_buf
> > > and calling xfs_buf_trylock(bp). i.e. this introduces a
> > > use-after-free race condition on the initial buffer access.
> > > 
> > 
> > Hmm.. the log item holds a temporary reference to the buffer when the
> > failed state is set. On submission, xfsaild queues the failed buffer
> > (new ref for the delwri queue), clears the failed state and drops the
> > failure reference of every failed item that is attached. xfsaild is also
> > the only task that knows how to process LI_FAILED items, so I don't
> > think we'd ever race with a failed buffer becoming "unfailed" from
> > xfsaild (which is the scenario where a buffer could disappear from I/O
> > completion).
> 
> Sure we can. every inode on the buffer has XFS_LI_FAILED set on it,
> so the first inode in the AIL triggers the resubmit and starts the
> IO. the AIL runs again, coming across another inode on the buffer
> further into the AIL. That inode has been modified in memory while
> the flush was in progress, so it no longer gets removed from the AIL
> by the IO completion. Instead, xfs_clear_li_failed() is called from
> xfs_iflush_done() and the buffer is removed from the logitem and the
> reference dropped.
> 

When xfsaild encounters the first failed item, the buffer resubmit path
(xfsaild_resubmit_item(), as of this series) calls xfs_clear_li_failed()
on every inode attached to the buffer. AFAICT, that means xfsaild will
see any associated inode as FLUSHING (or PINNED, if another in-core
modification has come through) once the buffer is requeued.

> > In some sense, clearing LI_FAILED of an item is essentially
> > retaking ownership of the item flush lock and hold of the underlying
> > buffer, but the existing code is certainly not written that way.
> 
> Only IO completion clears the LI_FAILED state of the item, not the
> xfsaild. IO completion already owns the flush lock - the xfsaild
> only holds it long enough to flush an inode to the buffer and then
> give the flush lock to the buffer.
> 
> Also, we clear LI_FAILED when removing the item from the AIL under
> the AIL lock, so the only case here that we are concerned about is
> the above "inode was relogged while being flushed" situation. THis
> situation is rare, failed buffers are rare, and so requiring the AIL
> lock to be held is a performance limiting factor here...
> 

Yep, though because of the above I think I/O completion clearing the
failed state might require a more subtle dance between xfsaild and
another submission context, such as if reclaim happens to flush an inode
that wasn't already attached to a previously failed buffer (for
example).

Eh, it doesn't matter that much in any event because..

> > The issue I was wrestling with wrt to this patch was primarily
> > maintaining consistency between the bit and the assignment of the
> > pointer on a failing I/O. E.g., the buffer pointer is set by the task
> > that sets the bit, but xfsaild checking the bit doesn't guarantee the
> > pointer had been set yet. I did add a post-buffer lock LI_FAILED check
> > as well, but that probably could have been an assert.
> 
> SO you've been focussing on races that may occur when the setting of
> the li_buf, but I'm looking at races involving the clearing of the
> li_buf pointer. :)
> 
> > > IOWs, the xfsaild cannot access lip->li_buf safely unless the
> > > set/clear operations are protected by the same lock the xfsaild
> > > holds. The xfsaild holds neither the buffer lock, a buffer reference
> > > or an item flush lock, hence it's the AIL lock or nothing...
> > > 
> > 
> > Yeah, that was my impression from reading the code. I realize from this
> > discussion that this doesn't actually simplify the logic. That was the
> > primary goal, so I think I need to revisit the approach here. Even if
> > this is correct (which I'm still not totally sure of), it's fragile in
> > the sense that it partly relies on single-threadedness of xfsaild. I
> > initially thought about adding a log item spinlock for this kind of
> > thing, but it didn't (and still doesn't) seem appropriate to burden
> > every log item in the system with a spinlock for the rare case of buffer
> > I/O errors.
> 
> I feel this is all largely a moot, because my patches result in this
> this whole problem of setting/clearing the li_buf for failed buffers
> go away altogether. Hence I would suggest that this patch gets
> dropped for now, because getting rid of the AIL lock is much less
> troublesome once LI_FAILED is no long dependent on the inode/dquot
> log item taking temporary references to the underlying buffer....
> 

... I'm good with that. ;) This was intended to be a low level cleanup
to eliminate a fairly minor ail_lock abuse. Working around the ->li_buf
thing is the primary challenge, so if you have a broader rework that
addresses that as a side effect then that presumably makes things
easier.  I'll revisit this if necessary once your work is further
along...

Brian

> > 
> > I mentioned in an earlier patch that it would be nice to consider
> > removing ->li_buf entirely but hadn't thought it through.
> 
> I'm going the opposite way entirely, such that LI_FAILED doesn't
> have any special state associated with it at all...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 23cbfeb82183..59906a6e49ae 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1038,7 +1038,6 @@  xfs_buf_do_callbacks_fail(
 	struct xfs_buf		*bp)
 {
 	struct xfs_log_item	*lip;
-	struct xfs_ail		*ailp;
 
 	/*
 	 * Buffer log item errors are handled directly by xfs_buf_item_push()
@@ -1050,13 +1049,10 @@  xfs_buf_do_callbacks_fail(
 
 	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
 			li_bio_list);
-	ailp = lip->li_ailp;
-	spin_lock(&ailp->ail_lock);
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 		if (lip->li_ops->iop_error)
 			lip->li_ops->iop_error(lip, bp);
 	}
-	spin_unlock(&ailp->ail_lock);
 }
 
 static bool
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index ffe607733c50..baeb111ae9dc 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1023,34 +1023,23 @@  xfs_qm_dqflush_done(
 	struct xfs_ail		*ailp = lip->li_ailp;
 
 	/*
-	 * We only want to pull the item from the AIL if its
-	 * location in the log has not changed since we started the flush.
-	 * Thus, we only bother if the dquot's lsn has
-	 * not changed. First we check the lsn outside the lock
-	 * since it's cheaper, and then we recheck while
-	 * holding the lock before removing the dquot from the AIL.
+	 * Only pull the item from the AIL if its location in the log has not
+	 * changed since it was flushed. Do a lockless check first to reduce
+	 * lock traffic.
 	 */
-	if (test_bit(XFS_LI_IN_AIL, &lip->li_flags) &&
-	    ((lip->li_lsn == qip->qli_flush_lsn) ||
-	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
-
-		/* xfs_trans_ail_delete() drops the AIL lock. */
-		spin_lock(&ailp->ail_lock);
-		if (lip->li_lsn == qip->qli_flush_lsn) {
-			xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
-		} else {
-			/*
-			 * Clear the failed state since we are about to drop the
-			 * flush lock
-			 */
-			xfs_clear_li_failed(lip);
-			spin_unlock(&ailp->ail_lock);
-		}
-	}
+	if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags) ||
+	    lip->li_lsn != qip->qli_flush_lsn)
+		goto out;
 
-	/*
-	 * Release the dq's flush lock since we're done with it.
-	 */
+	spin_lock(&ailp->ail_lock);
+	if (lip->li_lsn == qip->qli_flush_lsn)
+		/* xfs_trans_ail_delete() drops the AIL lock */
+		xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
+	else
+		spin_unlock(&ailp->ail_lock);
+
+out:
+	xfs_clear_li_failed(lip);
 	xfs_dqfunlock(dqp);
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 1d4d256a2e96..0ae61844b224 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -682,9 +682,7 @@  xfs_iflush_done(
 	 * Scan the buffer IO completions for other inodes being completed and
 	 * attach them to the current inode log item.
 	 */
-
 	list_add_tail(&lip->li_bio_list, &tmp);
-
 	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
 		if (lip->li_cb != xfs_iflush_done)
 			continue;
@@ -695,15 +693,13 @@  xfs_iflush_done(
 		 * the AIL lock.
 		 */
 		iip = INODE_ITEM(blip);
-		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
-		    test_bit(XFS_LI_FAILED, &blip->li_flags))
+		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
 			need_ail++;
 	}
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
-	    test_bit(XFS_LI_FAILED, &lip->li_flags))
+	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
 		need_ail++;
 
 	/*
@@ -732,8 +728,6 @@  xfs_iflush_done(
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
-			} else {
-				xfs_clear_li_failed(blip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
@@ -746,6 +740,7 @@  xfs_iflush_done(
 	 */
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
+		xfs_clear_li_failed(blip);
 		iip = INODE_ITEM(blip);
 		iip->ili_logged = 0;
 		iip->ili_last_fields = 0;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 2574d01e4a83..e03643efdac1 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -368,15 +368,23 @@  xfsaild_resubmit_item(
 {
 	struct xfs_buf		*bp = lip->li_buf;
 
-	if (!xfs_buf_trylock(bp))
+	/*
+	 * Log item state bits are racy so we cannot assume the temporary buffer
+	 * pointer is set. Treat the item as locked if the pointer is clear or
+	 * the failure state has changed once we've locked out I/O completion.
+	 */
+	if (!bp || !xfs_buf_trylock(bp))
 		return XFS_ITEM_LOCKED;
+	if (!test_bit(XFS_LI_FAILED, &lip->li_flags)) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_LOCKED;
+	}
 
 	if (!xfs_buf_delwri_queue(bp, buffer_list)) {
 		xfs_buf_unlock(bp);
 		return XFS_ITEM_FLUSHING;
 	}
 
-	/* protected by ail_lock */
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
 		xfs_clear_li_failed(lip);
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 35655eac01a6..9135afdcee9d 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -158,9 +158,6 @@  xfs_clear_li_failed(
 {
 	struct xfs_buf	*bp = lip->li_buf;
 
-	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
-	lockdep_assert_held(&lip->li_ailp->ail_lock);
-
 	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
 		lip->li_buf = NULL;
 		xfs_buf_rele(bp);
@@ -172,8 +169,6 @@  xfs_set_li_failed(
 	struct xfs_log_item	*lip,
 	struct xfs_buf		*bp)
 {
-	lockdep_assert_held(&lip->li_ailp->ail_lock);
-
 	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
 		xfs_buf_hold(bp);
 		lip->li_buf = bp;