diff mbox

[2/2,V3] xfs: Properly retry failed inode items in case of error during buffer writeback

Message ID 20170608140014.25132-3-cmaiolino@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Carlos Maiolino June 8, 2017, 2 p.m. UTC
When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
V2:
	- Fix XFS_LI_FAILED flag removal
	- Use atomic operations to set and clear XFS_LI_FAILED flag
	- Remove check for XBF_WRITE_FAIL in xfs_inode_item_push
	- Add more comments to the code
	- Add a helper function to resubmit the failed buffers, so this
	  can be also used in dquot system without duplicating code

V3:
	- drop xfs_imap_to_bp call using a pointer in the log item to
	  hold the buffer address, fixing a possible infinite loop if
	  xfs_map_to_bp returns -EIO
	- use xa_lock instead of atomic operations to handle log item
	  flags
	- Add a hold to the buffer for each log item failed
	- move buffer resubmission up in xfs_inode_item_push()

 fs/xfs/xfs_buf_item.c   | 39 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_buf_item.h   |  2 ++
 fs/xfs/xfs_inode_item.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trans.h      |  1 +
 fs/xfs/xfs_trans_ail.c  |  4 ++--
 5 files changed, 85 insertions(+), 5 deletions(-)

Comments

Brian Foster June 8, 2017, 4:45 p.m. UTC | #1
On Thu, Jun 08, 2017 at 04:00:14PM +0200, Carlos Maiolino wrote:
> When a buffer has been failed during writeback, the inode items into it
> are kept flush locked, and are never resubmitted due the flush lock, so,
> if any buffer fails to be written, the items in AIL are never written to
> disk and never unlocked.
> 
> This causes unmount operation to hang due these items flush locked in AIL,
> but this also causes the items in AIL to never be written back, even when
> the IO device comes back to normal.
> 
> I've been testing this patch with a DM-thin device, creating a
> filesystem larger than the real device.
> 
> When writing enough data to fill the DM-thin device, XFS receives ENOSPC
> errors from the device, and keep spinning on xfsaild (when 'retry
> forever' configuration is set).
> 
> At this point, the filesystem can not be unmounted because of the flush locked
> items in AIL, but worse, the items in AIL are never retried at all
> (once xfs_inode_item_push() will skip the items that are flush locked),
> even if the underlying DM-thin device is expanded to the proper size.
> 
> This patch fixes both cases, retrying any item that has been failed
> previously, using the infra-structure provided by the previous patch.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> V2:
> 	- Fix XFS_LI_FAILED flag removal
> 	- Use atomic operations to set and clear XFS_LI_FAILED flag
> 	- Remove check for XBF_WRITE_FAIL in xfs_inode_item_push
> 	- Add more comments to the code
> 	- Add a helper function to resubmit the failed buffers, so this
> 	  can be also used in dquot system without duplicating code
> 
> V3:
> 	- drop xfs_imap_to_bp call using a pointer in the log item to
> 	  hold the buffer address, fixing a possible infinite loop if
> 	  xfs_map_to_bp returns -EIO
> 	- use xa_lock instead of atomic operations to handle log item
> 	  flags
> 	- Add a hold to the buffer for each log item failed
> 	- move buffer resubmission up in xfs_inode_item_push()
> 
>  fs/xfs/xfs_buf_item.c   | 39 +++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_buf_item.h   |  2 ++
>  fs/xfs/xfs_inode_item.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_trans.h      |  1 +
>  fs/xfs/xfs_trans_ail.c  |  4 ++--
>  5 files changed, 85 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 523b7a4..9b7bd23 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1222,3 +1222,42 @@ xfs_buf_iodone(
>  	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
>  	xfs_buf_item_free(BUF_ITEM(lip));
>  }
> +
> +/*
> + * Requeue a failed buffer for writeback
> + *
> + * Return true if the buffer has been re-queued properly, false otherwise
> + */
> +bool
> +xfs_buf_resubmit_failed_buffers(
> +	struct xfs_inode	*ip,
> +	struct xfs_log_item	*lip,
> +	struct list_head	*buffer_list)
> +{
> +	struct xfs_log_item	*next;
> +	struct xfs_buf		*bp;
> +	bool			ret;
> +
> +	bp = lip->li_buf;
> +
> +	/* Clear XFS_LI_FAILED flag from all items before resubmit
> +	 *
> +	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> +	 * function already have it acquired
> +	 */
> +	for (; lip; lip = next) {
> +		next = lip->li_bio_list;
> +		lip->li_flags &= ~XFS_LI_FAILED;
> +		xfs_buf_rele(bp);

We need to check that the log item has LI_FAILED set before we release
the buffer because it's possible for the buffer to have a mix of failed
and recently flushed (not failed) items. We should set ->li_buf = NULL
wherever we clear LI_FAILED as well.

A couple small inline helpers to fail/unfail log items correctly (i.e.,
xfs_log_item_[fail|clear]() or some such) might help to make sure we get
the reference counting right. We could also assert that the correct lock
is held from those helpers.

> +	}
> +
> +	/* Add this buffer back to the delayed write list */
> +	xfs_buf_lock(bp);
> +	if (!xfs_buf_delwri_queue(bp, buffer_list))
> +		ret = false;
> +	else
> +		ret = true;
> +
> +	xfs_buf_unlock(bp);
> +	return ret;
> +}
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index f7eba99..32aceae 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -70,6 +70,8 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      xfs_log_item_t *);
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +bool	xfs_buf_resubmit_failed_buffers(struct xfs_inode *, struct xfs_log_item *,
> +					struct list_head *);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 08cb7d1..2f5a49e 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -27,6 +27,7 @@
>  #include "xfs_error.h"
>  #include "xfs_trace.h"
>  #include "xfs_trans_priv.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  
>  
> @@ -475,6 +476,26 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> +STATIC void
> +xfs_inode_item_error(
> +	struct xfs_log_item	*lip,
> +	struct xfs_buf		*bp,
> +	unsigned int		bflags)
> +{
> +
> +	/*
> +	 * The buffer writeback containing this inode has been failed
> +	 * mark it as failed and unlock the flush lock, so it can be retried
> +	 * again.
> +	 * Use xa_lock to avoid races with other log item state changes.
> +	 */
> +	spin_lock(&lip->li_ailp->xa_lock);

I'd push the ->xa_lock up into caller and cycle it once over the list
rather than once per item. This is more consistent with the locking
pattern for successful I/Os.

> +	xfs_buf_hold(bp);
> +	lip->li_flags |= XFS_LI_FAILED;
> +	lip->li_buf = bp;

Also, while I think this is technically Ok based on the current
implementation, I think we should also test that LI_FAILED is not
already set and only hold the buf in that case (i.e., the previously
mentioned helper). This helps prevent a landmine if the
implementation/locking/etc. changes down the road.

I think it would be fine to assert that LI_FAILED is not set already if
we wanted to, though, as long as there's a brief comment as well.

> +	spin_unlock(&lip->li_ailp->xa_lock);
> +}
> +
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -504,6 +525,18 @@ xfs_inode_item_push(
>  	}
>  
>  	/*
> +	 * The buffer containing this item failed to be written back
> +	 * previously. Resubmit the buffer for IO.
> +	 */
> +	if (lip->li_flags & XFS_LI_FAILED) {
> +		if (!xfs_buf_resubmit_failed_buffers(ip, lip,
> +						     buffer_list))
> +			rval = XFS_ITEM_FLUSHING;
> +
> +		goto out_unlock;
> +	}

I think this needs to go at least before the ilock. We don't need the
latter lock because the inode is already flushed. Somebody else could be
holding ilock and blocked on the flush lock. If that occurs, we'll never
make progress.

I also think we should lock/unlock the buffer here rather than down in
the helper. Hmm, maybe even queue it as well so it's clear what's going
on. For example, something like:

	if (lip->li_flags & XFS_LI_FAILED) {
		xfs_buf_lock(bp);
		xfs_buf_clear_failed_items(bp);
		if (!xfs_buf_delwri_queue(bp, buffer_list))
			rval = XFS_ITEM_FLUSHING;
		xfs_buf_unlock(bp);
		goto out_unlock;
	}

... but that is mostly just aesthetic.

> +
> +	/*
>  	 * Stale inode items should force out the iclog.
>  	 */
>  	if (ip->i_flags & XFS_ISTALE) {
> @@ -622,7 +655,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_unlock	= xfs_inode_item_unlock,
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
> -	.iop_committing = xfs_inode_item_committing
> +	.iop_committing = xfs_inode_item_committing,
> +	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> @@ -710,7 +744,8 @@ xfs_iflush_done(
>  		 * the AIL lock.
>  		 */
>  		iip = INODE_ITEM(blip);
> -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> +		    lip->li_flags & XFS_LI_FAILED)
>  			need_ail++;
>  
>  		blip = next;
> @@ -718,7 +753,8 @@ xfs_iflush_done(
>  
>  	/* make sure we capture the state of the initial inode. */
>  	iip = INODE_ITEM(lip);
> -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> +	    lip->li_flags & XFS_LI_FAILED)
>  		need_ail++;
>  
>  	/*
> @@ -739,6 +775,8 @@ xfs_iflush_done(
>  			if (INODE_ITEM(blip)->ili_logged &&
>  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
>  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> +			else if (blip->li_flags & XFS_LI_FAILED)
> +				blip->li_flags &= ~XFS_LI_FAILED;

This needs to do the full "unfail" sequence as well (another place to
call the helper).

>  		}
>  
>  		if (mlip_changed) {
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 3486517..86c3464 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
>  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
>  	uint				li_type;	/* item type */
>  	uint				li_flags;	/* misc flags */
> +	struct xfs_buf			*li_buf;	/* real buffer pointer */
>  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
>  	void				(*li_cb)(struct xfs_buf *,
>  						 struct xfs_log_item *);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9056c0f..36e08dd 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -687,13 +687,13 @@ xfs_trans_ail_update_bulk(
>  bool
>  xfs_ail_delete_one(
>  	struct xfs_ail		*ailp,
> -	struct xfs_log_item 	*lip)
> +	struct xfs_log_item	*lip)
>  {
>  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> -	lip->li_flags &= ~XFS_LI_IN_AIL;
> +	lip->li_flags &= ~(XFS_LI_IN_AIL | XFS_LI_FAILED);

... and the same thing here.

Brian

>  	lip->li_lsn = 0;
>  
>  	return mlip == lip;
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Carlos Maiolino June 12, 2017, 12:44 p.m. UTC | #2
Hey.

> > +/*
> > + * Requeue a failed buffer for writeback
> > + *
> > + * Return true if the buffer has been re-queued properly, false otherwise
> > + */
> > +bool
> > +xfs_buf_resubmit_failed_buffers(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_log_item	*lip,
> > +	struct list_head	*buffer_list)
> > +{
> > +	struct xfs_log_item	*next;
> > +	struct xfs_buf		*bp;
> > +	bool			ret;
> > +
> > +	bp = lip->li_buf;
> > +
> > +	/* Clear XFS_LI_FAILED flag from all items before resubmit
> > +	 *
> > +	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> > +	 * function already have it acquired
> > +	 */
> > +	for (; lip; lip = next) {
> > +		next = lip->li_bio_list;
> > +		lip->li_flags &= ~XFS_LI_FAILED;
> > +		xfs_buf_rele(bp);
> 
> We need to check that the log item has LI_FAILED set before we release
> the buffer because it's possible for the buffer to have a mix of failed
> and recently flushed (not failed) items. We should set ->li_buf = NULL
> wherever we clear LI_FAILED as well.
> 
> A couple small inline helpers to fail/unfail log items correctly (i.e.,
> xfs_log_item_[fail|clear]() or some such) might help to make sure we get
> the reference counting right. We could also assert that the correct lock
> is held from those helpers.
> 

Agreed, will queue it for V4

> > +	}
> > +
> > +	/* Add this buffer back to the delayed write list */
> > +	xfs_buf_lock(bp);
> > +	if (!xfs_buf_delwri_queue(bp, buffer_list))
> > +		ret = false;
> > +	else
> > +		ret = true;
> > +
> > +	xfs_buf_unlock(bp);
> > +	return ret;
> > +}
> > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > index f7eba99..32aceae 100644
> > --- a/fs/xfs/xfs_buf_item.h
> > +++ b/fs/xfs/xfs_buf_item.h
> > @@ -70,6 +70,8 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
> >  			      xfs_log_item_t *);
> >  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
> >  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> > +bool	xfs_buf_resubmit_failed_buffers(struct xfs_inode *, struct xfs_log_item *,
> > +					struct list_head *);
> >  
> >  extern kmem_zone_t	*xfs_buf_item_zone;
> >  
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 08cb7d1..2f5a49e 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -27,6 +27,7 @@
> >  #include "xfs_error.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_trans_priv.h"
> > +#include "xfs_buf_item.h"
> >  #include "xfs_log.h"
> >  
> >  
> > @@ -475,6 +476,26 @@ xfs_inode_item_unpin(
> >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> >  }
> >  
> > +STATIC void
> > +xfs_inode_item_error(
> > +	struct xfs_log_item	*lip,
> > +	struct xfs_buf		*bp,
> > +	unsigned int		bflags)
> > +{
> > +
> > +	/*
> > +	 * The buffer writeback containing this inode has been failed
> > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > +	 * again.
> > +	 * Use xa_lock to avoid races with other log item state changes.
> > +	 */
> > +	spin_lock(&lip->li_ailp->xa_lock);
> 
> I'd push the ->xa_lock up into caller and cycle it once over the list
> rather than once per item. This is more consistent with the locking
> pattern for successful I/Os.
> 

I considered it too, and decided to use spin_lock/unlock on each interaction to
avoid having it locked for a long time, but I don't think there is any harm in
using a single lock/unlock.

> > +	xfs_buf_hold(bp);
> > +	lip->li_flags |= XFS_LI_FAILED;
> > +	lip->li_buf = bp;
> 
> Also, while I think this is technically Ok based on the current
> implementation, I think we should also test that LI_FAILED is not
> already set and only hold the buf in that case (i.e., the previously
> mentioned helper). This helps prevent a landmine if the
> implementation/locking/etc. changes down the road.
> 
> I think it would be fine to assert that LI_FAILED is not set already if
> we wanted to, though, as long as there's a brief comment as well.
> 
> > +	spin_unlock(&lip->li_ailp->xa_lock);
> > +}
> > +
> >  STATIC uint
> >  xfs_inode_item_push(
> >  	struct xfs_log_item	*lip,
> > @@ -504,6 +525,18 @@ xfs_inode_item_push(
> >  	}
> >  
> >  	/*
> > +	 * The buffer containing this item failed to be written back
> > +	 * previously. Resubmit the buffer for IO.
> > +	 */
> > +	if (lip->li_flags & XFS_LI_FAILED) {
> > +		if (!xfs_buf_resubmit_failed_buffers(ip, lip,
> > +						     buffer_list))
> > +			rval = XFS_ITEM_FLUSHING;
> > +
> > +		goto out_unlock;
> > +	}
> 
> I think this needs to go at least before the ilock. We don't need the
> latter lock because the inode is already flushed. Somebody else could be
> holding ilock and blocked on the flush lock. If that occurs, we'll never
> make progress.

It should be after the ilock IIRC, having it before the ilock will deadlock it
any unmount when there are still inodes to be destroyed.
> 
> I also think we should lock/unlock the buffer here rather than down in
> the helper. Hmm, maybe even queue it as well so it's clear what's going
> on. For example, something like:
> 
> 	if (lip->li_flags & XFS_LI_FAILED) {
> 		xfs_buf_lock(bp);
> 		xfs_buf_clear_failed_items(bp);
> 		if (!xfs_buf_delwri_queue(bp, buffer_list))
> 			rval = XFS_ITEM_FLUSHING;
> 		xfs_buf_unlock(bp);
> 		goto out_unlock;
> 	}
> 

Well, we could do that, but we would need to duplicate this code between inode
and dquot item, while keeping the helper as-is, we can re-use it in both inode
and dquot.

> ... but that is mostly just aesthetic.
> 
> > +
> > +	/*
> >  	 * Stale inode items should force out the iclog.
> >  	 */
> >  	if (ip->i_flags & XFS_ISTALE) {
> > @@ -622,7 +655,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> >  	.iop_unlock	= xfs_inode_item_unlock,
> >  	.iop_committed	= xfs_inode_item_committed,
> >  	.iop_push	= xfs_inode_item_push,
> > -	.iop_committing = xfs_inode_item_committing
> > +	.iop_committing = xfs_inode_item_committing,
> > +	.iop_error	= xfs_inode_item_error
> >  };
> >  
> >  
> > @@ -710,7 +744,8 @@ xfs_iflush_done(
> >  		 * the AIL lock.
> >  		 */
> >  		iip = INODE_ITEM(blip);
> > -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> > +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> > +		    lip->li_flags & XFS_LI_FAILED)
> >  			need_ail++;
> >  
> >  		blip = next;
> > @@ -718,7 +753,8 @@ xfs_iflush_done(
> >  
> >  	/* make sure we capture the state of the initial inode. */
> >  	iip = INODE_ITEM(lip);
> > -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> > +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> > +	    lip->li_flags & XFS_LI_FAILED)
> >  		need_ail++;
> >  
> >  	/*
> > @@ -739,6 +775,8 @@ xfs_iflush_done(
> >  			if (INODE_ITEM(blip)->ili_logged &&
> >  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> >  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> > +			else if (blip->li_flags & XFS_LI_FAILED)
> > +				blip->li_flags &= ~XFS_LI_FAILED;
> 
> This needs to do the full "unfail" sequence as well (another place to
> call the helper).
> 
> >  		}
> >  
> >  		if (mlip_changed) {
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 3486517..86c3464 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
> >  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
> >  	uint				li_type;	/* item type */
> >  	uint				li_flags;	/* misc flags */
> > +	struct xfs_buf			*li_buf;	/* real buffer pointer */
> >  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
> >  	void				(*li_cb)(struct xfs_buf *,
> >  						 struct xfs_log_item *);
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 9056c0f..36e08dd 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -687,13 +687,13 @@ xfs_trans_ail_update_bulk(
> >  bool
> >  xfs_ail_delete_one(
> >  	struct xfs_ail		*ailp,
> > -	struct xfs_log_item 	*lip)
> > +	struct xfs_log_item	*lip)
> >  {
> >  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> >  
> >  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
> >  	xfs_ail_delete(ailp, lip);
> > -	lip->li_flags &= ~XFS_LI_IN_AIL;
> > +	lip->li_flags &= ~(XFS_LI_IN_AIL | XFS_LI_FAILED);
> 
> ... and the same thing here.
> 

I agree with the helpers, and I'll queue them up for V4, I'll just wait for some
more reviews on this version before sending V4.

Thanks for the review Brian.

Cheers
Brian Foster June 12, 2017, 1:45 p.m. UTC | #3
On Mon, Jun 12, 2017 at 02:44:23PM +0200, Carlos Maiolino wrote:
> Hey.
> 
> > > +/*
> > > + * Requeue a failed buffer for writeback
> > > + *
> > > + * Return true if the buffer has been re-queued properly, false otherwise
> > > + */
> > > +bool
> > > +xfs_buf_resubmit_failed_buffers(
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_log_item	*lip,
> > > +	struct list_head	*buffer_list)
> > > +{
> > > +	struct xfs_log_item	*next;
> > > +	struct xfs_buf		*bp;
> > > +	bool			ret;
> > > +
> > > +	bp = lip->li_buf;
> > > +
> > > +	/* Clear XFS_LI_FAILED flag from all items before resubmit
> > > +	 *
> > > +	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> > > +	 * function already have it acquired
> > > +	 */
> > > +	for (; lip; lip = next) {
> > > +		next = lip->li_bio_list;
> > > +		lip->li_flags &= ~XFS_LI_FAILED;
> > > +		xfs_buf_rele(bp);
> > 
> > We need to check that the log item has LI_FAILED set before we release
> > the buffer because it's possible for the buffer to have a mix of failed
> > and recently flushed (not failed) items. We should set ->li_buf = NULL
> > wherever we clear LI_FAILED as well.
> > 
> > A couple small inline helpers to fail/unfail log items correctly (i.e.,
> > xfs_log_item_[fail|clear]() or some such) might help to make sure we get
> > the reference counting right. We could also assert that the correct lock
> > is held from those helpers.
> > 
> 
> Agreed, will queue it for V4
> 
> > > +	}
> > > +
> > > +	/* Add this buffer back to the delayed write list */
> > > +	xfs_buf_lock(bp);
> > > +	if (!xfs_buf_delwri_queue(bp, buffer_list))
> > > +		ret = false;
> > > +	else
> > > +		ret = true;
> > > +
> > > +	xfs_buf_unlock(bp);
> > > +	return ret;
> > > +}
> > > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > > index f7eba99..32aceae 100644
> > > --- a/fs/xfs/xfs_buf_item.h
> > > +++ b/fs/xfs/xfs_buf_item.h
> > > @@ -70,6 +70,8 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
> > >  			      xfs_log_item_t *);
> > >  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
> > >  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> > > +bool	xfs_buf_resubmit_failed_buffers(struct xfs_inode *, struct xfs_log_item *,
> > > +					struct list_head *);
> > >  
> > >  extern kmem_zone_t	*xfs_buf_item_zone;
> > >  
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index 08cb7d1..2f5a49e 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -27,6 +27,7 @@
> > >  #include "xfs_error.h"
> > >  #include "xfs_trace.h"
> > >  #include "xfs_trans_priv.h"
> > > +#include "xfs_buf_item.h"
> > >  #include "xfs_log.h"
> > >  
> > >  
> > > @@ -475,6 +476,26 @@ xfs_inode_item_unpin(
> > >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> > >  }
> > >  
> > > +STATIC void
> > > +xfs_inode_item_error(
> > > +	struct xfs_log_item	*lip,
> > > +	struct xfs_buf		*bp,
> > > +	unsigned int		bflags)
> > > +{
> > > +
> > > +	/*
> > > +	 * The buffer writeback containing this inode has been failed
> > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > +	 * again.
> > > +	 * Use xa_lock to avoid races with other log item state changes.
> > > +	 */
> > > +	spin_lock(&lip->li_ailp->xa_lock);
> > 
> > I'd push the ->xa_lock up into caller and cycle it once over the list
> > rather than once per item. This is more consistent with the locking
> > pattern for successful I/Os.
> > 
> 
> I considered it too, and decided to use spin_lock/unlock on each interaction to
> avoid having it locked for a long time, but I don't think there is any harm in
> using a single lock/unlock.
> 

I don't think we're too concerned about performance at this low a level
when when I/Os are failing (or yet, at least), but I'm guessing that
pulling up the lock is still lighter weight than what a successful I/O
completion does.

That aside, I think we also want the entire sequence of marking
LI_FAILED to be atomic with respect to the AIL (e.g., so the next AIL
push sees the state of the items before the errors are processed or
after, not somewhere in between).

> > > +	xfs_buf_hold(bp);
> > > +	lip->li_flags |= XFS_LI_FAILED;
> > > +	lip->li_buf = bp;
> > 
> > Also, while I think this is technically Ok based on the current
> > implementation, I think we should also test that LI_FAILED is not
> > already set and only hold the buf in that case (i.e., the previously
> > mentioned helper). This helps prevent a landmine if the
> > implementation/locking/etc. changes down the road.
> > 
> > I think it would be fine to assert that LI_FAILED is not set already if
> > we wanted to, though, as long as there's a brief comment as well.
> > 
> > > +	spin_unlock(&lip->li_ailp->xa_lock);
> > > +}
> > > +
> > >  STATIC uint
> > >  xfs_inode_item_push(
> > >  	struct xfs_log_item	*lip,
> > > @@ -504,6 +525,18 @@ xfs_inode_item_push(
> > >  	}
> > >  
> > >  	/*
> > > +	 * The buffer containing this item failed to be written back
> > > +	 * previously. Resubmit the buffer for IO.
> > > +	 */
> > > +	if (lip->li_flags & XFS_LI_FAILED) {
> > > +		if (!xfs_buf_resubmit_failed_buffers(ip, lip,
> > > +						     buffer_list))
> > > +			rval = XFS_ITEM_FLUSHING;
> > > +
> > > +		goto out_unlock;
> > > +	}
> > 
> > I think this needs to go at least before the ilock. We don't need the
> > latter lock because the inode is already flushed. Somebody else could be
> > holding ilock and blocked on the flush lock. If that occurs, we'll never
> > make progress.
> 
> It should be after the ilock IIRC, having it before the ilock will deadlock it
> any unmount when there are still inodes to be destroyed.

Hmm.. so if we get into the failed state on an inode and a reclaim
attempt occurs on the inode, xfs_reclaim_inode() can grab the ilock and
then block on the flush lock. With this code as it is, we'll now never
resubmit the failed buffer (unless via some other inode, by chance)
because we can't get the inode lock. IOW, we're stuck in a livelock that
resembles the original problem.

AFAICT, we don't need the inode lock in this case at all. The only
reason for it here that I'm aware of is to follow the ilock -> flush
lock ordering rules and we know the inode is already flush locked in the
failed case. Can you elaborate on the unmount deadlock issue if we move
this before the ilock? Shouldn't the I/O error handling detect the
unmounting state the next go around and flag it as a permanent error?
I'm wondering if there is some other bug lurking there that we need to
work around...

(Note that I think it's important we resolve this above all else,
because getting this correct is probably what dictates whether this is a
viable approach or that we need to start looking at something like
Dave's flush lock rework idea.)

> > 
> > I also think we should lock/unlock the buffer here rather than down in
> > the helper. Hmm, maybe even queue it as well so it's clear what's going
> > on. For example, something like:
> > 
> > 	if (lip->li_flags & XFS_LI_FAILED) {
> > 		xfs_buf_lock(bp);
> > 		xfs_buf_clear_failed_items(bp);
> > 		if (!xfs_buf_delwri_queue(bp, buffer_list))
> > 			rval = XFS_ITEM_FLUSHING;
> > 		xfs_buf_unlock(bp);
> > 		goto out_unlock;
> > 	}
> > 
> 
> Well, we could do that, but we would need to duplicate this code between inode
> and dquot item, while keeping the helper as-is, we can re-use it in both inode
> and dquot.
> 

That doesn't seem so bad to me. It's basically just a lock and delwri
queue of a buffer. That said, we could also create a couple separate
helpers here: one that requeues a buffer and another that clears failed
items on a buffer. Hmm, I'm actually wondering if we can refactor
xfs_inode_item_push() a bit to just reuse some of the existing code
here. Perhaps it's best to get the locking down correctly before looking
into this...

BTW, I just realized that the xfs_buf_lock() above probably needs to be
a trylock (where we return XFS_ITEM_LOCKED on failure). The buffer lock
via xfs_iflush() is a trylock as well, and we actually drop and
reacquire ->xa_lock across that and the delwri queue of the buffer (but
I'm not totally sure that part is necessary here).

> > ... but that is mostly just aesthetic.
> > 
> > > +
> > > +	/*
> > >  	 * Stale inode items should force out the iclog.
> > >  	 */
> > >  	if (ip->i_flags & XFS_ISTALE) {
> > > @@ -622,7 +655,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> > >  	.iop_unlock	= xfs_inode_item_unlock,
> > >  	.iop_committed	= xfs_inode_item_committed,
> > >  	.iop_push	= xfs_inode_item_push,
> > > -	.iop_committing = xfs_inode_item_committing
> > > +	.iop_committing = xfs_inode_item_committing,
> > > +	.iop_error	= xfs_inode_item_error
> > >  };
> > >  
> > >  
> > > @@ -710,7 +744,8 @@ xfs_iflush_done(
> > >  		 * the AIL lock.
> > >  		 */
> > >  		iip = INODE_ITEM(blip);
> > > -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> > > +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> > > +		    lip->li_flags & XFS_LI_FAILED)
> > >  			need_ail++;
> > >  
> > >  		blip = next;
> > > @@ -718,7 +753,8 @@ xfs_iflush_done(
> > >  
> > >  	/* make sure we capture the state of the initial inode. */
> > >  	iip = INODE_ITEM(lip);
> > > -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> > > +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> > > +	    lip->li_flags & XFS_LI_FAILED)
> > >  		need_ail++;
> > >  
> > >  	/*
> > > @@ -739,6 +775,8 @@ xfs_iflush_done(
> > >  			if (INODE_ITEM(blip)->ili_logged &&
> > >  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> > >  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> > > +			else if (blip->li_flags & XFS_LI_FAILED)
> > > +				blip->li_flags &= ~XFS_LI_FAILED;
> > 
> > This needs to do the full "unfail" sequence as well (another place to
> > call the helper).
> > 
> > >  		}
> > >  
> > >  		if (mlip_changed) {
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index 3486517..86c3464 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
> > >  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
> > >  	uint				li_type;	/* item type */
> > >  	uint				li_flags;	/* misc flags */
> > > +	struct xfs_buf			*li_buf;	/* real buffer pointer */
> > >  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
> > >  	void				(*li_cb)(struct xfs_buf *,
> > >  						 struct xfs_log_item *);
> > > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > > index 9056c0f..36e08dd 100644
> > > --- a/fs/xfs/xfs_trans_ail.c
> > > +++ b/fs/xfs/xfs_trans_ail.c
> > > @@ -687,13 +687,13 @@ xfs_trans_ail_update_bulk(
> > >  bool
> > >  xfs_ail_delete_one(
> > >  	struct xfs_ail		*ailp,
> > > -	struct xfs_log_item 	*lip)
> > > +	struct xfs_log_item	*lip)
> > >  {
> > >  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> > >  
> > >  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
> > >  	xfs_ail_delete(ailp, lip);
> > > -	lip->li_flags &= ~XFS_LI_IN_AIL;
> > > +	lip->li_flags &= ~(XFS_LI_IN_AIL | XFS_LI_FAILED);
> > 
> > ... and the same thing here.
> > 
> 
> I agree with the helpers, and I'll queue them up for V4, I'll just wait for some
> more reviews on this version before sending V4.
> 

Sounds good.

Brian

> Thanks for the review Brian.
> 
> Cheers
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Carlos Maiolino June 14, 2017, 12:29 p.m. UTC | #4
On Mon, Jun 12, 2017 at 09:45:44AM -0400, Brian Foster wrote:
> On Mon, Jun 12, 2017 at 02:44:23PM +0200, Carlos Maiolino wrote:
> > Hey.
> > 
> > > > +/*
> > > > + * Requeue a failed buffer for writeback
> > > > + *
> > > > + * Return true if the buffer has been re-queued properly, false otherwise
> > > > + */
> > > > +bool
> > > > +xfs_buf_resubmit_failed_buffers(
> > > > +	struct xfs_inode	*ip,
> > > > +	struct xfs_log_item	*lip,
> > > > +	struct list_head	*buffer_list)
> > > > +{
> > > > +	struct xfs_log_item	*next;
> > > > +	struct xfs_buf		*bp;
> > > > +	bool			ret;
> > > > +
> > > > +	bp = lip->li_buf;
> > > > +
> > > > +	/* Clear XFS_LI_FAILED flag from all items before resubmit
> > > > +	 *
> > > > +	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> > > > +	 * function already have it acquired
> > > > +	 */
> > > > +	for (; lip; lip = next) {
> > > > +		next = lip->li_bio_list;
> > > > +		lip->li_flags &= ~XFS_LI_FAILED;
> > > > +		xfs_buf_rele(bp);
> > > 
> > > We need to check that the log item has LI_FAILED set before we release
> > > the buffer because it's possible for the buffer to have a mix of failed
> > > and recently flushed (not failed) items. We should set ->li_buf = NULL
> > > wherever we clear LI_FAILED as well.
> > > 
> > > A couple small inline helpers to fail/unfail log items correctly (i.e.,
> > > xfs_log_item_[fail|clear]() or some such) might help to make sure we get
> > > the reference counting right. We could also assert that the correct lock
> > > is held from those helpers.
> > > 
> > 
> > Agreed, will queue it for V4
> > 
> > > > +	}
> > > > +
> > > > +	/* Add this buffer back to the delayed write list */
> > > > +	xfs_buf_lock(bp);
> > > > +	if (!xfs_buf_delwri_queue(bp, buffer_list))
> > > > +		ret = false;
> > > > +	else
> > > > +		ret = true;
> > > > +
> > > > +	xfs_buf_unlock(bp);
> > > > +	return ret;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > > > index f7eba99..32aceae 100644
> > > > --- a/fs/xfs/xfs_buf_item.h
> > > > +++ b/fs/xfs/xfs_buf_item.h
> > > > @@ -70,6 +70,8 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
> > > >  			      xfs_log_item_t *);
> > > >  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
> > > >  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> > > > +bool	xfs_buf_resubmit_failed_buffers(struct xfs_inode *, struct xfs_log_item *,
> > > > +					struct list_head *);
> > > >  
> > > >  extern kmem_zone_t	*xfs_buf_item_zone;
> > > >  
> > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > > index 08cb7d1..2f5a49e 100644
> > > > --- a/fs/xfs/xfs_inode_item.c
> > > > +++ b/fs/xfs/xfs_inode_item.c
> > > > @@ -27,6 +27,7 @@
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_trace.h"
> > > >  #include "xfs_trans_priv.h"
> > > > +#include "xfs_buf_item.h"
> > > >  #include "xfs_log.h"
> > > >  
> > > >  
> > > > @@ -475,6 +476,26 @@ xfs_inode_item_unpin(
> > > >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> > > >  }
> > > >  
> > > > +STATIC void
> > > > +xfs_inode_item_error(
> > > > +	struct xfs_log_item	*lip,
> > > > +	struct xfs_buf		*bp,
> > > > +	unsigned int		bflags)
> > > > +{
> > > > +
> > > > +	/*
> > > > +	 * The buffer writeback containing this inode has been failed
> > > > +	 * mark it as failed and unlock the flush lock, so it can be retried
> > > > +	 * again.
> > > > +	 * Use xa_lock to avoid races with other log item state changes.
> > > > +	 */
> > > > +	spin_lock(&lip->li_ailp->xa_lock);
> > > 
> > > I'd push the ->xa_lock up into caller and cycle it once over the list
> > > rather than once per item. This is more consistent with the locking
> > > pattern for successful I/Os.
> > > 
> > 
> > I considered it too, and decided to use spin_lock/unlock on each interaction to
> > avoid having it locked for a long time, but I don't think there is any harm in
> > using a single lock/unlock.
> > 
> 
> I don't think we're too concerned about performance at this low a level
> when when I/Os are failing (or yet, at least), but I'm guessing that
> pulling up the lock is still lighter weight than what a successful I/O
> completion does.
> 
> That aside, I think we also want the entire sequence of marking
> LI_FAILED to be atomic with respect to the AIL (e.g., so the next AIL
> push sees the state of the items before the errors are processed or
> after, not somewhere in between).
> 

Agreed.

> > > > +	xfs_buf_hold(bp);
> > > > +	lip->li_flags |= XFS_LI_FAILED;
> > > > +	lip->li_buf = bp;
> > > 
> > > Also, while I think this is technically Ok based on the current
> > > implementation, I think we should also test that LI_FAILED is not
> > > already set and only hold the buf in that case (i.e., the previously
> > > mentioned helper). This helps prevent a landmine if the
> > > implementation/locking/etc. changes down the road.
> > > 
> > > I think it would be fine to assert that LI_FAILED is not set already if
> > > we wanted to, though, as long as there's a brief comment as well.
> > > 
> > > > +	spin_unlock(&lip->li_ailp->xa_lock);
> > > > +}
> > > > +
> > > >  STATIC uint
> > > >  xfs_inode_item_push(
> > > >  	struct xfs_log_item	*lip,
> > > > @@ -504,6 +525,18 @@ xfs_inode_item_push(
> > > >  	}
> > > >  
> > > >  	/*
> > > > +	 * The buffer containing this item failed to be written back
> > > > +	 * previously. Resubmit the buffer for IO.
> > > > +	 */
> > > > +	if (lip->li_flags & XFS_LI_FAILED) {
> > > > +		if (!xfs_buf_resubmit_failed_buffers(ip, lip,
> > > > +						     buffer_list))
> > > > +			rval = XFS_ITEM_FLUSHING;
> > > > +
> > > > +		goto out_unlock;
> > > > +	}
> > > 
> > > I think this needs to go at least before the ilock. We don't need the
> > > latter lock because the inode is already flushed. Somebody else could be
> > > holding ilock and blocked on the flush lock. If that occurs, we'll never
> > > make progress.
> > 
> > It should be after the ilock IIRC, having it before the ilock will deadlock it
> > any unmount when there are still inodes to be destroyed.
> 
> Hmm.. so if we get into the failed state on an inode and a reclaim
> attempt occurs on the inode, xfs_reclaim_inode() can grab the ilock and
> then block on the flush lock. With this code as it is, we'll now never
> resubmit the failed buffer (unless via some other inode, by chance)
> because we can't get the inode lock. IOW, we're stuck in a livelock that
> resembles the original problem.
> 
> AFAICT, we don't need the inode lock in this case at all. The only
> reason for it here that I'm aware of is to follow the ilock -> flush
> lock ordering rules and we know the inode is already flush locked in the
> failed case. Can you elaborate on the unmount deadlock issue if we move
> this before the ilock? Shouldn't the I/O error handling detect the
> unmounting state the next go around and flag it as a permanent error?
> I'm wondering if there is some other bug lurking there that we need to
> work around...
> 

Nevermind, I didn't investigate when I've hit the deadlock, I found what
happened, fixed, and I'll queue this to V4.

> (Note that I think it's important we resolve this above all else,
> because getting this correct is probably what dictates whether this is a
> viable approach or that we need to start looking at something like
> Dave's flush lock rework idea.)
> 
> > > 
> > > I also think we should lock/unlock the buffer here rather than down in
> > > the helper. Hmm, maybe even queue it as well so it's clear what's going
> > > on. For example, something like:
> > > 
> > > 	if (lip->li_flags & XFS_LI_FAILED) {
> > > 		xfs_buf_lock(bp);
> > > 		xfs_buf_clear_failed_items(bp);
> > > 		if (!xfs_buf_delwri_queue(bp, buffer_list))
> > > 			rval = XFS_ITEM_FLUSHING;
> > > 		xfs_buf_unlock(bp);
> > > 		goto out_unlock;
> > > 	}
> > > 
> > 
> > Well, we could do that, but we would need to duplicate this code between inode
> > and dquot item, while keeping the helper as-is, we can re-use it in both inode
> > and dquot.
> > 
> 
> That doesn't seem so bad to me. It's basically just a lock and delwri
> queue of a buffer. That said, we could also create a couple separate
> helpers here: one that requeues a buffer and another that clears failed
> items on a buffer. Hmm, I'm actually wondering if we can refactor
> xfs_inode_item_push() a bit to just reuse some of the existing code
> here. Perhaps it's best to get the locking down correctly before looking
> into this...
> 
> BTW, I just realized that the xfs_buf_lock() above probably needs to be
> a trylock (where we return XFS_ITEM_LOCKED on failure). The buffer lock
> via xfs_iflush() is a trylock as well, and we actually drop and
> reacquire ->xa_lock across that and the delwri queue of the buffer (but
> I'm not totally sure that part is necessary here).
> 
> > > ... but that is mostly just aesthetic.
> > > 
> > > > +
> > > > +	/*
> > > >  	 * Stale inode items should force out the iclog.
> > > >  	 */
> > > >  	if (ip->i_flags & XFS_ISTALE) {
> > > > @@ -622,7 +655,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
> > > >  	.iop_unlock	= xfs_inode_item_unlock,
> > > >  	.iop_committed	= xfs_inode_item_committed,
> > > >  	.iop_push	= xfs_inode_item_push,
> > > > -	.iop_committing = xfs_inode_item_committing
> > > > +	.iop_committing = xfs_inode_item_committing,
> > > > +	.iop_error	= xfs_inode_item_error
> > > >  };
> > > >  
> > > >  
> > > > @@ -710,7 +744,8 @@ xfs_iflush_done(
> > > >  		 * the AIL lock.
> > > >  		 */
> > > >  		iip = INODE_ITEM(blip);
> > > > -		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
> > > > +		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> > > > +		    lip->li_flags & XFS_LI_FAILED)
> > > >  			need_ail++;
> > > >  
> > > >  		blip = next;
> > > > @@ -718,7 +753,8 @@ xfs_iflush_done(
> > > >  
> > > >  	/* make sure we capture the state of the initial inode. */
> > > >  	iip = INODE_ITEM(lip);
> > > > -	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
> > > > +	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> > > > +	    lip->li_flags & XFS_LI_FAILED)
> > > >  		need_ail++;
> > > >  
> > > >  	/*
> > > > @@ -739,6 +775,8 @@ xfs_iflush_done(
> > > >  			if (INODE_ITEM(blip)->ili_logged &&
> > > >  			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
> > > >  				mlip_changed |= xfs_ail_delete_one(ailp, blip);
> > > > +			else if (blip->li_flags & XFS_LI_FAILED)
> > > > +				blip->li_flags &= ~XFS_LI_FAILED;
> > > 
> > > This needs to do the full "unfail" sequence as well (another place to
> > > call the helper).
> > > 
> > > >  		}
> > > >  
> > > >  		if (mlip_changed) {
> > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > > index 3486517..86c3464 100644
> > > > --- a/fs/xfs/xfs_trans.h
> > > > +++ b/fs/xfs/xfs_trans.h
> > > > @@ -49,6 +49,7 @@ typedef struct xfs_log_item {
> > > >  	struct xfs_ail			*li_ailp;	/* ptr to AIL */
> > > >  	uint				li_type;	/* item type */
> > > >  	uint				li_flags;	/* misc flags */
> > > > +	struct xfs_buf			*li_buf;	/* real buffer pointer */
> > > >  	struct xfs_log_item		*li_bio_list;	/* buffer item list */
> > > >  	void				(*li_cb)(struct xfs_buf *,
> > > >  						 struct xfs_log_item *);
> > > > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > > > index 9056c0f..36e08dd 100644
> > > > --- a/fs/xfs/xfs_trans_ail.c
> > > > +++ b/fs/xfs/xfs_trans_ail.c
> > > > @@ -687,13 +687,13 @@ xfs_trans_ail_update_bulk(
> > > >  bool
> > > >  xfs_ail_delete_one(
> > > >  	struct xfs_ail		*ailp,
> > > > -	struct xfs_log_item 	*lip)
> > > > +	struct xfs_log_item	*lip)
> > > >  {
> > > >  	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
> > > >  
> > > >  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
> > > >  	xfs_ail_delete(ailp, lip);
> > > > -	lip->li_flags &= ~XFS_LI_IN_AIL;
> > > > +	lip->li_flags &= ~(XFS_LI_IN_AIL | XFS_LI_FAILED);
> > > 
> > > ... and the same thing here.
> > > 
> > 
> > I agree with the helpers, and I'll queue them up for V4, I'll just wait for some
> > more reviews on this version before sending V4.
> > 
> 
> Sounds good.
> 
> Brian
> 
> > Thanks for the review Brian.
> > 
> > Cheers
> > 
> > -- 
> > Carlos
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 523b7a4..9b7bd23 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1222,3 +1222,42 @@  xfs_buf_iodone(
 	xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
 	xfs_buf_item_free(BUF_ITEM(lip));
 }
+
+/*
+ * Requeue a failed buffer for writeback
+ *
+ * Return true if the buffer has been re-queued properly, false otherwise
+ */
+bool
+xfs_buf_resubmit_failed_buffers(
+	struct xfs_inode	*ip,
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	struct xfs_log_item	*next;
+	struct xfs_buf		*bp;
+	bool			ret;
+
+	bp = lip->li_buf;
+
+	/* Clear XFS_LI_FAILED flag from all items before resubmit
+	 *
+	 * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
+	 * function already have it acquired
+	 */
+	for (; lip; lip = next) {
+		next = lip->li_bio_list;
+		lip->li_flags &= ~XFS_LI_FAILED;
+		xfs_buf_rele(bp);
+	}
+
+	/* Add this buffer back to the delayed write list */
+	xfs_buf_lock(bp);
+	if (!xfs_buf_delwri_queue(bp, buffer_list))
+		ret = false;
+	else
+		ret = true;
+
+	xfs_buf_unlock(bp);
+	return ret;
+}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index f7eba99..32aceae 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -70,6 +70,8 @@  void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      xfs_log_item_t *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+bool	xfs_buf_resubmit_failed_buffers(struct xfs_inode *, struct xfs_log_item *,
+					struct list_head *);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 08cb7d1..2f5a49e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -27,6 +27,7 @@ 
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
 #include "xfs_log.h"
 
 
@@ -475,6 +476,26 @@  xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
+STATIC void
+xfs_inode_item_error(
+	struct xfs_log_item	*lip,
+	struct xfs_buf		*bp,
+	unsigned int		bflags)
+{
+
+	/*
+	 * The buffer writeback containing this inode has been failed
+	 * mark it as failed and unlock the flush lock, so it can be retried
+	 * again.
+	 * Use xa_lock to avoid races with other log item state changes.
+	 */
+	spin_lock(&lip->li_ailp->xa_lock);
+	xfs_buf_hold(bp);
+	lip->li_flags |= XFS_LI_FAILED;
+	lip->li_buf = bp;
+	spin_unlock(&lip->li_ailp->xa_lock);
+}
+
 STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
@@ -504,6 +525,18 @@  xfs_inode_item_push(
 	}
 
 	/*
+	 * The buffer containing this item failed to be written back
+	 * previously. Resubmit the buffer for IO.
+	 */
+	if (lip->li_flags & XFS_LI_FAILED) {
+		if (!xfs_buf_resubmit_failed_buffers(ip, lip,
+						     buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+
+		goto out_unlock;
+	}
+
+	/*
 	 * Stale inode items should force out the iclog.
 	 */
 	if (ip->i_flags & XFS_ISTALE) {
@@ -622,7 +655,8 @@  static const struct xfs_item_ops xfs_inode_item_ops = {
 	.iop_unlock	= xfs_inode_item_unlock,
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
-	.iop_committing = xfs_inode_item_committing
+	.iop_committing = xfs_inode_item_committing,
+	.iop_error	= xfs_inode_item_error
 };
 
 
@@ -710,7 +744,8 @@  xfs_iflush_done(
 		 * the AIL lock.
 		 */
 		iip = INODE_ITEM(blip);
-		if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn)
+		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
+		    lip->li_flags & XFS_LI_FAILED)
 			need_ail++;
 
 		blip = next;
@@ -718,7 +753,8 @@  xfs_iflush_done(
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn)
+	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
+	    lip->li_flags & XFS_LI_FAILED)
 		need_ail++;
 
 	/*
@@ -739,6 +775,8 @@  xfs_iflush_done(
 			if (INODE_ITEM(blip)->ili_logged &&
 			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
 				mlip_changed |= xfs_ail_delete_one(ailp, blip);
+			else if (blip->li_flags & XFS_LI_FAILED)
+				blip->li_flags &= ~XFS_LI_FAILED;
 		}
 
 		if (mlip_changed) {
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 3486517..86c3464 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -49,6 +49,7 @@  typedef struct xfs_log_item {
 	struct xfs_ail			*li_ailp;	/* ptr to AIL */
 	uint				li_type;	/* item type */
 	uint				li_flags;	/* misc flags */
+	struct xfs_buf			*li_buf;	/* real buffer pointer */
 	struct xfs_log_item		*li_bio_list;	/* buffer item list */
 	void				(*li_cb)(struct xfs_buf *,
 						 struct xfs_log_item *);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9056c0f..36e08dd 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -687,13 +687,13 @@  xfs_trans_ail_update_bulk(
 bool
 xfs_ail_delete_one(
 	struct xfs_ail		*ailp,
-	struct xfs_log_item 	*lip)
+	struct xfs_log_item	*lip)
 {
 	struct xfs_log_item	*mlip = xfs_ail_min(ailp);
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
-	lip->li_flags &= ~XFS_LI_IN_AIL;
+	lip->li_flags &= ~(XFS_LI_IN_AIL | XFS_LI_FAILED);
 	lip->li_lsn = 0;
 
 	return mlip == lip;